-
-
Save armancohan/ae914f96e7884a92c164dd021964ce49 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
``` | |
g batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 354 at step 88500 | time: 1345.55s | valid loss 0.74 | bpc 1.07357 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 130 step 88510 | 16 batches | lr 0.000442 | ms/batch 11917.46 | loss 0.75 | bpc 1.07888 | |
| epoch 130 step 88520 | 26 batches | lr 0.000442 | ms/batch 5110.18 | loss 0.78 | bpc 1.12858 | |
| epoch 130 step 88530 | 36 batches | lr 0.000442 | ms/batch 5107.78 | loss 0.71 | bpc 1.02528 | |
| epoch 130 step 88540 | 46 batches | lr 0.000442 | ms/batch 5109.07 | loss 0.74 | bpc 1.07031 | |
| epoch 130 step 88550 | 56 batches | lr 0.000442 | ms/batch 5111.60 | loss 0.78 | bpc 1.12227 | |
| epoch 130 step 88560 | 66 batches | lr 0.000442 | ms/batch 5109.90 | loss 0.76 | bpc 1.10196 | |
| epoch 130 step 88570 | 76 batches | lr 0.000442 | ms/batch 5110.75 | loss 0.72 | bpc 1.04134 | |
| epoch 130 step 88580 | 86 batches | lr 0.000442 | ms/batch 5106.93 | loss 0.72 | bpc 1.03947 | |
| epoch 130 step 88590 | 96 batches | lr 0.000442 | ms/batch 5111.03 | loss 0.72 | bpc 1.03796 | |
| epoch 130 step 88600 | 106 batches | lr 0.000442 | ms/batch 5111.09 | loss 0.76 | bpc 1.09314 | |
| epoch 130 step 88610 | 116 batches | lr 0.000442 | ms/batch 5110.12 | loss 0.75 | bpc 1.08070 | |
| epoch 130 step 88620 | 126 batches | lr 0.000442 | ms/batch 5111.19 | loss 0.76 | bpc 1.09950 | |
| epoch 130 step 88630 | 136 batches | lr 0.000442 | ms/batch 5109.70 | loss 0.78 | bpc 1.12416 | |
| epoch 130 step 88640 | 146 batches | lr 0.000442 | ms/batch 5109.59 | loss 0.77 | bpc 1.11730 | |
| epoch 130 step 88650 | 156 batches | lr 0.000442 | ms/batch 5108.91 | loss 0.78 | bpc 1.11911 | |
| epoch 130 step 88660 | 166 batches | lr 0.000442 | ms/batch 5107.52 | loss 0.78 | bpc 1.13135 | |
| epoch 130 step 88670 | 176 batches | lr 0.000442 | ms/batch 5109.05 | loss 0.68 | bpc 0.98543 | |
| epoch 130 step 88680 | 186 batches | lr 0.000442 | ms/batch 5108.64 | loss 0.69 | bpc 0.99098 | |
| epoch 130 step 88690 | 196 batches | lr 0.000442 | ms/batch 5111.96 | loss 0.67 | bpc 0.95956 | |
| epoch 130 step 88700 | 206 batches | lr 0.000442 | ms/batch 5109.56 | loss 0.65 | bpc 0.94402 | |
| epoch 130 step 88710 | 216 batches | lr 0.000442 | ms/batch 5109.99 | loss 0.76 | bpc 1.08980 | |
| epoch 130 step 88720 | 226 batches | lr 0.000442 | ms/batch 5110.06 | loss 0.74 | bpc 1.07421 | |
| epoch 130 step 88730 | 236 batches | lr 0.000442 | ms/batch 5109.98 | loss 0.74 | bpc 1.06617 | |
| epoch 130 step 88740 | 246 batches | lr 0.000442 | ms/batch 5111.55 | loss 0.78 | bpc 1.12912 | |
| epoch 130 step 88750 | 256 batches | lr 0.000442 | ms/batch 5109.93 | loss 0.75 | bpc 1.08629 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 355 at step 88750 | time: 1345.67s | valid loss 0.75 | bpc 1.07538 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 130 step 88760 | 266 batches | lr 0.000442 | ms/batch 11920.42 | loss 0.71 | bpc 1.01712 | |
| epoch 130 step 88770 | 276 batches | lr 0.000442 | ms/batch 5116.65 | loss 0.75 | bpc 1.07615 | |
| epoch 130 step 88780 | 286 batches | lr 0.000442 | ms/batch 5110.19 | loss 0.79 | bpc 1.13928 | |
| epoch 130 step 88790 | 296 batches | lr 0.000442 | ms/batch 5109.88 | loss 0.75 | bpc 1.07881 | |
| epoch 130 step 88800 | 306 batches | lr 0.000442 | ms/batch 5110.47 | loss 0.75 | bpc 1.08729 | |
| epoch 130 step 88810 | 316 batches | lr 0.000442 | ms/batch 5117.64 | loss 0.78 | bpc 1.12217 | |
| epoch 130 step 88820 | 326 batches | lr 0.000442 | ms/batch 5113.82 | loss 0.75 | bpc 1.08376 | |
| epoch 130 step 88830 | 336 batches | lr 0.000442 | ms/batch 5110.18 | loss 0.75 | bpc 1.08456 | |
| epoch 130 step 88840 | 346 batches | lr 0.000442 | ms/batch 5108.92 | loss 0.72 | bpc 1.04117 | |
| epoch 130 step 88850 | 356 batches | lr 0.000442 | ms/batch 5110.31 | loss 0.78 | bpc 1.13044 | |
| epoch 130 step 88860 | 366 batches | lr 0.000442 | ms/batch 5112.23 | loss 0.74 | bpc 1.07128 | |
| epoch 130 step 88870 | 376 batches | lr 0.000442 | ms/batch 5108.59 | loss 0.73 | bpc 1.06027 | |
| epoch 130 step 88880 | 386 batches | lr 0.000442 | ms/batch 5111.86 | loss 0.77 | bpc 1.11659 | |
| epoch 130 step 88890 | 396 batches | lr 0.000442 | ms/batch 5110.05 | loss 0.76 | bpc 1.10069 | |
| epoch 130 step 88900 | 406 batches | lr 0.000441 | ms/batch 5109.84 | loss 0.77 | bpc 1.11302 | |
| epoch 130 step 88910 | 416 batches | lr 0.000441 | ms/batch 5107.88 | loss 0.75 | bpc 1.08743 | |
| epoch 130 step 88920 | 426 batches | lr 0.000441 | ms/batch 5108.41 | loss 0.72 | bpc 1.03669 | |
| epoch 130 step 88930 | 436 batches | lr 0.000441 | ms/batch 5106.52 | loss 0.78 | bpc 1.13148 | |
| epoch 130 step 88940 | 446 batches | lr 0.000441 | ms/batch 5106.56 | loss 0.76 | bpc 1.09944 | |
| epoch 130 step 88950 | 456 batches | lr 0.000441 | ms/batch 5111.91 | loss 0.76 | bpc 1.09645 | |
| epoch 130 step 88960 | 466 batches | lr 0.000441 | ms/batch 5108.37 | loss 0.75 | bpc 1.08181 | |
| epoch 130 step 88970 | 476 batches | lr 0.000441 | ms/batch 5111.26 | loss 0.74 | bpc 1.07304 | |
| epoch 130 step 88980 | 486 batches | lr 0.000441 | ms/batch 5112.44 | loss 0.76 | bpc 1.09970 | |
| epoch 130 step 88990 | 496 batches | lr 0.000441 | ms/batch 5110.46 | loss 0.76 | bpc 1.08984 | |
| epoch 130 step 89000 | 506 batches | lr 0.000441 | ms/batch 5109.07 | loss 0.74 | bpc 1.06597 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 356 at step 89000 | time: 1345.80s | valid loss 0.74 | bpc 1.07352 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 130 step 89010 | 516 batches | lr 0.000441 | ms/batch 11921.14 | loss 0.76 | bpc 1.08943 | |
| epoch 130 step 89020 | 526 batches | lr 0.000441 | ms/batch 5110.10 | loss 0.76 | bpc 1.10151 | |
| epoch 130 step 89030 | 536 batches | lr 0.000441 | ms/batch 5110.45 | loss 0.79 | bpc 1.14166 | |
| epoch 130 step 89040 | 546 batches | lr 0.000441 | ms/batch 5107.80 | loss 0.74 | bpc 1.07298 | |
| epoch 130 step 89050 | 556 batches | lr 0.000441 | ms/batch 5110.10 | loss 0.77 | bpc 1.11248 | |
| epoch 130 step 89060 | 566 batches | lr 0.000441 | ms/batch 5110.26 | loss 0.78 | bpc 1.12557 | |
| epoch 130 step 89070 | 576 batches | lr 0.000441 | ms/batch 5108.65 | loss 0.74 | bpc 1.06934 | |
| epoch 130 step 89080 | 586 batches | lr 0.000441 | ms/batch 5109.55 | loss 0.75 | bpc 1.07815 | |
| epoch 130 step 89090 | 596 batches | lr 0.000441 | ms/batch 5110.68 | loss 0.77 | bpc 1.11480 | |
| epoch 130 step 89100 | 606 batches | lr 0.000441 | ms/batch 5108.65 | loss 0.74 | bpc 1.06829 | |
| epoch 130 step 89110 | 616 batches | lr 0.000441 | ms/batch 5111.61 | loss 0.74 | bpc 1.06888 | |
| epoch 130 step 89120 | 626 batches | lr 0.000441 | ms/batch 5109.05 | loss 0.76 | bpc 1.09776 | |
| epoch 130 step 89130 | 636 batches | lr 0.000441 | ms/batch 5109.98 | loss 0.77 | bpc 1.11165 | |
| epoch 130 step 89140 | 646 batches | lr 0.000441 | ms/batch 5111.07 | loss 0.76 | bpc 1.10206 | |
| epoch 130 step 89150 | 656 batches | lr 0.000441 | ms/batch 5111.23 | loss 0.79 | bpc 1.14227 | |
| epoch 130 step 89160 | 666 batches | lr 0.000441 | ms/batch 5109.62 | loss 0.78 | bpc 1.13069 | |
| epoch 130 step 89170 | 676 batches | lr 0.000441 | ms/batch 5109.70 | loss 0.70 | bpc 1.01404 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 130 step 89180 | 686 batches | lr 0.000441 | ms/batch 5109.98 | loss 0.70 | bpc 1.00632 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 131 step 89190 | 10 batches | lr 0.000441 | ms/batch 5109.06 | loss 0.73 | bpc 1.05705 | |
| epoch 131 step 89200 | 20 batches | lr 0.000441 | ms/batch 5111.14 | loss 0.76 | bpc 1.10275 | |
| epoch 131 step 89210 | 30 batches | lr 0.000441 | ms/batch 5109.35 | loss 0.76 | bpc 1.09028 | |
| epoch 131 step 89220 | 40 batches | lr 0.000441 | ms/batch 5108.13 | loss 0.72 | bpc 1.03795 | |
| epoch 131 step 89230 | 50 batches | lr 0.000441 | ms/batch 5108.15 | loss 0.76 | bpc 1.09844 | |
| epoch 131 step 89240 | 60 batches | lr 0.000441 | ms/batch 5108.84 | loss 0.77 | bpc 1.11221 | |
| epoch 131 step 89250 | 70 batches | lr 0.000441 | ms/batch 5106.78 | loss 0.75 | bpc 1.07824 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 357 at step 89250 | time: 1345.60s | valid loss 0.75 | bpc 1.07814 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 131 step 89260 | 80 batches | lr 0.000441 | ms/batch 11920.11 | loss 0.71 | bpc 1.02346 | |
| epoch 131 step 89270 | 90 batches | lr 0.000441 | ms/batch 5111.27 | loss 0.74 | bpc 1.06365 | |
| epoch 131 step 89280 | 100 batches | lr 0.000441 | ms/batch 5108.17 | loss 0.71 | bpc 1.02649 | |
| epoch 131 step 89290 | 110 batches | lr 0.000441 | ms/batch 5111.37 | loss 0.77 | bpc 1.10952 | |
| epoch 131 step 89300 | 120 batches | lr 0.000441 | ms/batch 5106.71 | loss 0.74 | bpc 1.06355 | |
| epoch 131 step 89310 | 130 batches | lr 0.000441 | ms/batch 5109.98 | loss 0.78 | bpc 1.12773 | |
| epoch 131 step 89320 | 140 batches | lr 0.000441 | ms/batch 5110.65 | loss 0.80 | bpc 1.14948 | |
| epoch 131 step 89330 | 150 batches | lr 0.000441 | ms/batch 5107.86 | loss 0.76 | bpc 1.08944 | |
| epoch 131 step 89340 | 160 batches | lr 0.000441 | ms/batch 5111.13 | loss 0.79 | bpc 1.13985 | |
| epoch 131 step 89350 | 170 batches | lr 0.000441 | ms/batch 5112.23 | loss 0.74 | bpc 1.06195 | |
| epoch 131 step 89360 | 180 batches | lr 0.000441 | ms/batch 5112.82 | loss 0.68 | bpc 0.97906 | |
| epoch 131 step 89370 | 190 batches | lr 0.000441 | ms/batch 5111.72 | loss 0.68 | bpc 0.97975 | |
| epoch 131 step 89380 | 200 batches | lr 0.000441 | ms/batch 5107.57 | loss 0.64 | bpc 0.91855 | |
| epoch 131 step 89390 | 210 batches | lr 0.000441 | ms/batch 5107.89 | loss 0.70 | bpc 1.01642 | |
| epoch 131 step 89400 | 220 batches | lr 0.000441 | ms/batch 5109.97 | loss 0.77 | bpc 1.11069 | |
| epoch 131 step 89410 | 230 batches | lr 0.000441 | ms/batch 5113.63 | loss 0.74 | bpc 1.07091 | |
| epoch 131 step 89420 | 240 batches | lr 0.000441 | ms/batch 5109.81 | loss 0.74 | bpc 1.07293 | |
| epoch 131 step 89430 | 250 batches | lr 0.000441 | ms/batch 5113.48 | loss 0.79 | bpc 1.13605 | |
| epoch 131 step 89440 | 260 batches | lr 0.000441 | ms/batch 5110.54 | loss 0.73 | bpc 1.04907 | |
| epoch 131 step 89450 | 270 batches | lr 0.000441 | ms/batch 5109.24 | loss 0.71 | bpc 1.02469 | |
| epoch 131 step 89460 | 280 batches | lr 0.000441 | ms/batch 5109.88 | loss 0.76 | bpc 1.08932 | |
| epoch 131 step 89470 | 290 batches | lr 0.000441 | ms/batch 5109.57 | loss 0.78 | bpc 1.12607 | |
| epoch 131 step 89480 | 300 batches | lr 0.000441 | ms/batch 5108.89 | loss 0.75 | bpc 1.08896 | |
| epoch 131 step 89490 | 310 batches | lr 0.000441 | ms/batch 5110.77 | loss 0.78 | bpc 1.11997 | |
| epoch 131 step 89500 | 320 batches | lr 0.000441 | ms/batch 5109.16 | loss 0.74 | bpc 1.06922 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 358 at step 89500 | time: 1349.56s | valid loss 0.74 | bpc 1.07336 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 131 step 89510 | 330 batches | lr 0.000441 | ms/batch 11920.62 | loss 0.76 | bpc 1.10263 | |
| epoch 131 step 89520 | 340 batches | lr 0.000441 | ms/batch 5111.67 | loss 0.73 | bpc 1.04803 | |
| epoch 131 step 89530 | 350 batches | lr 0.000441 | ms/batch 5110.09 | loss 0.75 | bpc 1.07544 | |
| epoch 131 step 89540 | 360 batches | lr 0.000441 | ms/batch 5110.72 | loss 0.79 | bpc 1.14284 | |
| epoch 131 step 89550 | 370 batches | lr 0.000441 | ms/batch 5109.86 | loss 0.73 | bpc 1.05921 | |
| epoch 131 step 89560 | 380 batches | lr 0.000441 | ms/batch 5108.25 | loss 0.75 | bpc 1.08468 | |
| epoch 131 step 89570 | 390 batches | lr 0.000441 | ms/batch 5108.25 | loss 0.75 | bpc 1.08260 | |
| epoch 131 step 89580 | 400 batches | lr 0.000441 | ms/batch 5109.65 | loss 0.79 | bpc 1.14525 | |
| epoch 131 step 89590 | 410 batches | lr 0.000441 | ms/batch 5107.64 | loss 0.76 | bpc 1.09968 | |
| epoch 131 step 89600 | 420 batches | lr 0.000441 | ms/batch 5112.26 | loss 0.73 | bpc 1.05447 | |
| epoch 131 step 89610 | 430 batches | lr 0.000441 | ms/batch 5107.44 | loss 0.74 | bpc 1.06818 | |
| epoch 131 step 89620 | 440 batches | lr 0.000441 | ms/batch 5108.20 | loss 0.77 | bpc 1.11727 | |
| epoch 131 step 89630 | 450 batches | lr 0.000441 | ms/batch 5109.88 | loss 0.76 | bpc 1.10264 | |
| epoch 131 step 89640 | 460 batches | lr 0.000441 | ms/batch 5109.29 | loss 0.76 | bpc 1.09631 | |
| epoch 131 step 89650 | 470 batches | lr 0.000441 | ms/batch 5108.21 | loss 0.73 | bpc 1.05995 | |
| epoch 131 step 89660 | 480 batches | lr 0.000441 | ms/batch 5108.40 | loss 0.77 | bpc 1.10835 | |
| epoch 131 step 89670 | 490 batches | lr 0.000441 | ms/batch 5109.71 | loss 0.75 | bpc 1.07987 | |
| epoch 131 step 89680 | 500 batches | lr 0.000441 | ms/batch 5108.73 | loss 0.74 | bpc 1.07462 | |
| epoch 131 step 89690 | 510 batches | lr 0.00044 | ms/batch 5109.62 | loss 0.75 | bpc 1.08154 | |
| epoch 131 step 89700 | 520 batches | lr 0.00044 | ms/batch 5110.14 | loss 0.76 | bpc 1.09554 | |
| epoch 131 step 89710 | 530 batches | lr 0.00044 | ms/batch 5113.81 | loss 0.78 | bpc 1.11824 | |
| epoch 131 step 89720 | 540 batches | lr 0.00044 | ms/batch 5107.65 | loss 0.78 | bpc 1.13175 | |
| epoch 131 step 89730 | 550 batches | lr 0.00044 | ms/batch 5108.45 | loss 0.74 | bpc 1.07103 | |
| epoch 131 step 89740 | 560 batches | lr 0.00044 | ms/batch 5112.42 | loss 0.77 | bpc 1.11572 | |
| epoch 131 step 89750 | 570 batches | lr 0.00044 | ms/batch 5112.12 | loss 0.76 | bpc 1.10327 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 359 at step 89750 | time: 1345.59s | valid loss 0.75 | bpc 1.07808 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 131 step 89760 | 580 batches | lr 0.00044 | ms/batch 11917.89 | loss 0.74 | bpc 1.07099 | |
| epoch 131 step 89770 | 590 batches | lr 0.00044 | ms/batch 5109.72 | loss 0.76 | bpc 1.08982 | |
| epoch 131 step 89780 | 600 batches | lr 0.00044 | ms/batch 5110.57 | loss 0.76 | bpc 1.09137 | |
| epoch 131 step 89790 | 610 batches | lr 0.00044 | ms/batch 5110.66 | loss 0.73 | bpc 1.06013 | |
| epoch 131 step 89800 | 620 batches | lr 0.00044 | ms/batch 5109.59 | loss 0.76 | bpc 1.09484 | |
| epoch 131 step 89810 | 630 batches | lr 0.00044 | ms/batch 5110.00 | loss 0.77 | bpc 1.11182 | |
| epoch 131 step 89820 | 640 batches | lr 0.00044 | ms/batch 5111.86 | loss 0.76 | bpc 1.09106 | |
| epoch 131 step 89830 | 650 batches | lr 0.00044 | ms/batch 5106.51 | loss 0.80 | bpc 1.14809 | |
| epoch 131 step 89840 | 660 batches | lr 0.00044 | ms/batch 5111.10 | loss 0.78 | bpc 1.11960 | |
| epoch 131 step 89850 | 670 batches | lr 0.00044 | ms/batch 5109.66 | loss 0.75 | bpc 1.08794 | |
| epoch 131 step 89860 | 680 batches | lr 0.00044 | ms/batch 5114.22 | loss 0.71 | bpc 1.01771 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 132 step 89870 | 4 batches | lr 0.00044 | ms/batch 5112.71 | loss 0.70 | bpc 1.01284 | |
| epoch 132 step 89880 | 14 batches | lr 0.00044 | ms/batch 5112.81 | loss 0.74 | bpc 1.07386 | |
| epoch 132 step 89890 | 24 batches | lr 0.00044 | ms/batch 5109.00 | loss 0.77 | bpc 1.11671 | |
| epoch 132 step 89900 | 34 batches | lr 0.00044 | ms/batch 5108.15 | loss 0.73 | bpc 1.05401 | |
| epoch 132 step 89910 | 44 batches | lr 0.00044 | ms/batch 5108.40 | loss 0.73 | bpc 1.05036 | |
| epoch 132 step 89920 | 54 batches | lr 0.00044 | ms/batch 5111.73 | loss 0.77 | bpc 1.11514 | |
| epoch 132 step 89930 | 64 batches | lr 0.00044 | ms/batch 5111.25 | loss 0.76 | bpc 1.09007 | |
| epoch 132 step 89940 | 74 batches | lr 0.00044 | ms/batch 5112.55 | loss 0.74 | bpc 1.07113 | |
| epoch 132 step 89950 | 84 batches | lr 0.00044 | ms/batch 5108.78 | loss 0.71 | bpc 1.01992 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
| epoch 132 step 89960 | 94 batches | lr 0.00044 | ms/batch 5110.21 | loss 0.72 | bpc 1.04388 | |
| epoch 132 step 89970 | 104 batches | lr 0.00044 | ms/batch 5108.28 | loss 0.75 | bpc 1.07957 | |
| epoch 132 step 89980 | 114 batches | lr 0.00044 | ms/batch 5112.02 | loss 0.75 | bpc 1.08257 | |
| epoch 132 step 89990 | 124 batches | lr 0.00044 | ms/batch 5111.27 | loss 0.76 | bpc 1.09971 | |
| epoch 132 step 90000 | 134 batches | lr 0.00044 | ms/batch 5111.29 | loss 0.76 | bpc 1.10265 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 360 at step 90000 | time: 1345.86s | valid loss 0.74 | bpc 1.07125 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 132 step 90010 | 144 batches | lr 0.00044 | ms/batch 11918.63 | loss 0.79 | bpc 1.13840 | |
| epoch 132 step 90020 | 154 batches | lr 0.00044 | ms/batch 5111.13 | loss 0.77 | bpc 1.10728 | |
| epoch 132 step 90030 | 164 batches | lr 0.00044 | ms/batch 5108.84 | loss 0.79 | bpc 1.14190 | |
| epoch 132 step 90040 | 174 batches | lr 0.00044 | ms/batch 5104.21 | loss 0.69 | bpc 0.98871 | |
| epoch 132 step 90050 | 184 batches | lr 0.00044 | ms/batch 5109.38 | loss 0.69 | bpc 0.99003 | |
| epoch 132 step 90060 | 194 batches | lr 0.00044 | ms/batch 5108.63 | loss 0.69 | bpc 0.99983 | |
| epoch 132 step 90070 | 204 batches | lr 0.00044 | ms/batch 5113.68 | loss 0.62 | bpc 0.89916 | |
| epoch 132 step 90080 | 214 batches | lr 0.00044 | ms/batch 5107.05 | loss 0.73 | bpc 1.05723 | |
| epoch 132 step 90090 | 224 batches | lr 0.00044 | ms/batch 5110.52 | loss 0.75 | bpc 1.07939 | |
| epoch 132 step 90100 | 234 batches | lr 0.00044 | ms/batch 5107.37 | loss 0.74 | bpc 1.07226 | |
| epoch 132 step 90110 | 244 batches | lr 0.00044 | ms/batch 5111.16 | loss 0.76 | bpc 1.09685 | |
| epoch 132 step 90120 | 254 batches | lr 0.00044 | ms/batch 5109.74 | loss 0.78 | bpc 1.12795 | |
| epoch 132 step 90130 | 264 batches | lr 0.00044 | ms/batch 5110.06 | loss 0.71 | bpc 1.01871 | |
| epoch 132 step 90140 | 274 batches | lr 0.00044 | ms/batch 5109.58 | loss 0.73 | bpc 1.04763 | |
| epoch 132 step 90150 | 284 batches | lr 0.00044 | ms/batch 5109.58 | loss 0.77 | bpc 1.11597 | |
| epoch 132 step 90160 | 294 batches | lr 0.00044 | ms/batch 5106.87 | loss 0.76 | bpc 1.10210 | |
| epoch 132 step 90170 | 304 batches | lr 0.00044 | ms/batch 5110.62 | loss 0.76 | bpc 1.09758 | |
| epoch 132 step 90180 | 314 batches | lr 0.00044 | ms/batch 5113.50 | loss 0.77 | bpc 1.11179 | |
| epoch 132 step 90190 | 324 batches | lr 0.00044 | ms/batch 5113.21 | loss 0.75 | bpc 1.08505 | |
| epoch 132 step 90200 | 334 batches | lr 0.00044 | ms/batch 5110.91 | loss 0.76 | bpc 1.09656 | |
| epoch 132 step 90210 | 344 batches | lr 0.00044 | ms/batch 5110.33 | loss 0.71 | bpc 1.02596 | |
| epoch 132 step 90220 | 354 batches | lr 0.00044 | ms/batch 5109.63 | loss 0.78 | bpc 1.12958 | |
| epoch 132 step 90230 | 364 batches | lr 0.00044 | ms/batch 5109.47 | loss 0.75 | bpc 1.08200 | |
| epoch 132 step 90240 | 374 batches | lr 0.00044 | ms/batch 5109.00 | loss 0.73 | bpc 1.04912 | |
| epoch 132 step 90250 | 384 batches | lr 0.00044 | ms/batch 5110.77 | loss 0.76 | bpc 1.10366 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 361 at step 90250 | time: 1345.61s | valid loss 0.75 | bpc 1.07563 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 132 step 90260 | 394 batches | lr 0.00044 | ms/batch 11914.05 | loss 0.76 | bpc 1.09406 | |
| epoch 132 step 90270 | 404 batches | lr 0.00044 | ms/batch 5110.51 | loss 0.77 | bpc 1.11756 | |
| epoch 132 step 90280 | 414 batches | lr 0.00044 | ms/batch 5109.65 | loss 0.76 | bpc 1.09405 | |
| epoch 132 step 90290 | 424 batches | lr 0.00044 | ms/batch 5108.36 | loss 0.73 | bpc 1.04742 | |
| epoch 132 step 90300 | 434 batches | lr 0.00044 | ms/batch 5108.32 | loss 0.77 | bpc 1.10474 | |
| epoch 132 step 90310 | 444 batches | lr 0.00044 | ms/batch 5105.16 | loss 0.77 | bpc 1.10366 | |
| epoch 132 step 90320 | 454 batches | lr 0.00044 | ms/batch 5111.63 | loss 0.77 | bpc 1.11278 | |
| epoch 132 step 90330 | 464 batches | lr 0.00044 | ms/batch 5108.50 | loss 0.75 | bpc 1.08353 | |
| epoch 132 step 90340 | 474 batches | lr 0.00044 | ms/batch 5111.40 | loss 0.74 | bpc 1.06543 | |
| epoch 132 step 90350 | 484 batches | lr 0.00044 | ms/batch 5110.23 | loss 0.76 | bpc 1.09448 | |
| epoch 132 step 90360 | 494 batches | lr 0.00044 | ms/batch 5111.32 | loss 0.76 | bpc 1.09820 | |
| epoch 132 step 90370 | 504 batches | lr 0.00044 | ms/batch 5110.38 | loss 0.74 | bpc 1.06161 | |
| epoch 132 step 90380 | 514 batches | lr 0.00044 | ms/batch 5110.22 | loss 0.74 | bpc 1.07323 | |
| epoch 132 step 90390 | 524 batches | lr 0.00044 | ms/batch 5112.17 | loss 0.75 | bpc 1.08921 | |
| epoch 132 step 90400 | 534 batches | lr 0.00044 | ms/batch 5109.81 | loss 0.80 | bpc 1.15589 | |
| epoch 132 step 90410 | 544 batches | lr 0.00044 | ms/batch 5109.77 | loss 0.75 | bpc 1.07926 | |
| epoch 132 step 90420 | 554 batches | lr 0.00044 | ms/batch 5109.18 | loss 0.76 | bpc 1.10255 | |
| epoch 132 step 90430 | 564 batches | lr 0.00044 | ms/batch 5109.46 | loss 0.78 | bpc 1.11963 | |
| epoch 132 step 90440 | 574 batches | lr 0.00044 | ms/batch 5108.64 | loss 0.75 | bpc 1.07886 | |
| epoch 132 step 90450 | 584 batches | lr 0.00044 | ms/batch 5108.96 | loss 0.74 | bpc 1.06800 | |
| epoch 132 step 90460 | 594 batches | lr 0.00044 | ms/batch 5109.54 | loss 0.77 | bpc 1.11170 | |
| epoch 132 step 90470 | 604 batches | lr 0.00044 | ms/batch 5105.91 | loss 0.75 | bpc 1.08418 | |
| epoch 132 step 90480 | 614 batches | lr 0.000439 | ms/batch 5108.94 | loss 0.73 | bpc 1.05024 | |
| epoch 132 step 90490 | 624 batches | lr 0.000439 | ms/batch 5106.03 | loss 0.76 | bpc 1.09936 | |
| epoch 132 step 90500 | 634 batches | lr 0.000439 | ms/batch 5108.51 | loss 0.77 | bpc 1.11459 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 362 at step 90500 | time: 1345.44s | valid loss 0.75 | bpc 1.07697 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 132 step 90510 | 644 batches | lr 0.000439 | ms/batch 11917.64 | loss 0.75 | bpc 1.08751 | |
| epoch 132 step 90520 | 654 batches | lr 0.000439 | ms/batch 5108.35 | loss 0.81 | bpc 1.16146 | |
| epoch 132 step 90530 | 664 batches | lr 0.000439 | ms/batch 5110.55 | loss 0.78 | bpc 1.12155 | |
| epoch 132 step 90540 | 674 batches | lr 0.000439 | ms/batch 5109.67 | loss 0.71 | bpc 1.02879 | |
| epoch 132 step 90550 | 684 batches | lr 0.000439 | ms/batch 5107.41 | loss 0.70 | bpc 1.01536 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 133 step 90560 | 8 batches | lr 0.000439 | ms/batch 5109.46 | loss 0.72 | bpc 1.04578 | |
| epoch 133 step 90570 | 18 batches | lr 0.000439 | ms/batch 5111.35 | loss 0.75 | bpc 1.08669 | |
| epoch 133 step 90580 | 28 batches | lr 0.000439 | ms/batch 5110.25 | loss 0.77 | bpc 1.10752 | |
| epoch 133 step 90590 | 38 batches | lr 0.000439 | ms/batch 5109.34 | loss 0.72 | bpc 1.03898 | |
| epoch 133 step 90600 | 48 batches | lr 0.000439 | ms/batch 5110.65 | loss 0.74 | bpc 1.06887 | |
| epoch 133 step 90610 | 58 batches | lr 0.000439 | ms/batch 5110.19 | loss 0.77 | bpc 1.11426 | |
| epoch 133 step 90620 | 68 batches | lr 0.000439 | ms/batch 5109.88 | loss 0.76 | bpc 1.09651 | |
| epoch 133 step 90630 | 78 batches | lr 0.000439 | ms/batch 5108.22 | loss 0.71 | bpc 1.02904 | |
| epoch 133 step 90640 | 88 batches | lr 0.000439 | ms/batch 5112.59 | loss 0.73 | bpc 1.04620 | |
| epoch 133 step 90650 | 98 batches | lr 0.000439 | ms/batch 5108.71 | loss 0.71 | bpc 1.02830 | |
| epoch 133 step 90660 | 108 batches | lr 0.000439 | ms/batch 5109.81 | loss 0.77 | bpc 1.11130 | |
| epoch 133 step 90670 | 118 batches | lr 0.000439 | ms/batch 5110.42 | loss 0.74 | bpc 1.07122 | |
| epoch 133 step 90680 | 128 batches | lr 0.000439 | ms/batch 5111.45 | loss 0.78 | bpc 1.12016 | |
| epoch 133 step 90690 | 138 batches | lr 0.000439 | ms/batch 5110.80 | loss 0.79 | bpc 1.13280 | |
| epoch 133 step 90700 | 148 batches | lr 0.000439 | ms/batch 5110.50 | loss 0.75 | bpc 1.07933 | |
| epoch 133 step 90710 | 158 batches | lr 0.000439 | ms/batch 5110.77 | loss 0.79 | bpc 1.13885 | |
| epoch 133 step 90720 | 168 batches | lr 0.000439 | ms/batch 5111.67 | loss 0.76 | bpc 1.09247 | |
| epoch 133 step 90730 | 178 batches | lr 0.000439 | ms/batch 5111.43 | loss 0.68 | bpc 0.97713 | |
| epoch 133 step 90740 | 188 batches | lr 0.000439 | ms/batch 5107.81 | loss 0.68 | bpc 0.98147 | |
| epoch 133 step 90750 | 198 batches | lr 0.000439 | ms/batch 5108.37 | loss 0.65 | bpc 0.94320 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 363 at step 90750 | time: 1345.63s | valid loss 0.74 | bpc 1.07103 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 133 step 90760 | 208 batches | lr 0.000439 | ms/batch 11911.55 | loss 0.67 | bpc 0.97205 | |
| epoch 133 step 90770 | 218 batches | lr 0.000439 | ms/batch 5107.83 | loss 0.76 | bpc 1.09557 | |
| epoch 133 step 90780 | 228 batches | lr 0.000439 | ms/batch 5111.95 | loss 0.75 | bpc 1.07634 | |
| epoch 133 step 90790 | 238 batches | lr 0.000439 | ms/batch 5109.72 | loss 0.74 | bpc 1.06581 | |
| epoch 133 step 90800 | 248 batches | lr 0.000439 | ms/batch 5104.38 | loss 0.78 | bpc 1.12792 | |
| epoch 133 step 90810 | 258 batches | lr 0.000439 | ms/batch 5109.96 | loss 0.75 | bpc 1.07968 | |
| epoch 133 step 90820 | 268 batches | lr 0.000439 | ms/batch 5111.96 | loss 0.70 | bpc 1.00924 | |
| epoch 133 step 90830 | 278 batches | lr 0.000439 | ms/batch 5113.19 | loss 0.75 | bpc 1.08046 | |
| epoch 133 step 90840 | 288 batches | lr 0.000439 | ms/batch 5108.38 | loss 0.79 | bpc 1.13475 | |
| epoch 133 step 90850 | 298 batches | lr 0.000439 | ms/batch 5106.72 | loss 0.75 | bpc 1.08389 | |
| epoch 133 step 90860 | 308 batches | lr 0.000439 | ms/batch 5119.74 | loss 0.76 | bpc 1.09639 | |
| epoch 133 step 90870 | 318 batches | lr 0.000439 | ms/batch 5127.41 | loss 0.76 | bpc 1.09804 | |
| epoch 133 step 90880 | 328 batches | lr 0.000439 | ms/batch 5128.21 | loss 0.76 | bpc 1.09567 | |
| epoch 133 step 90890 | 338 batches | lr 0.000439 | ms/batch 5128.82 | loss 0.73 | bpc 1.05680 | |
| epoch 133 step 90900 | 348 batches | lr 0.000439 | ms/batch 5131.21 | loss 0.73 | bpc 1.04701 | |
| epoch 133 step 90910 | 358 batches | lr 0.000439 | ms/batch 5125.34 | loss 0.80 | bpc 1.14760 | |
| epoch 133 step 90920 | 368 batches | lr 0.000439 | ms/batch 5121.63 | loss 0.74 | bpc 1.07188 | |
| epoch 133 step 90930 | 378 batches | lr 0.000439 | ms/batch 5127.15 | loss 0.74 | bpc 1.06917 | |
| epoch 133 step 90940 | 388 batches | lr 0.000439 | ms/batch 5128.78 | loss 0.77 | bpc 1.11407 | |
| epoch 133 step 90950 | 398 batches | lr 0.000439 | ms/batch 5122.35 | loss 0.77 | bpc 1.10687 | |
| epoch 133 step 90960 | 408 batches | lr 0.000439 | ms/batch 5119.91 | loss 0.77 | bpc 1.11076 | |
| epoch 133 step 90970 | 418 batches | lr 0.000439 | ms/batch 5122.06 | loss 0.74 | bpc 1.06522 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
| epoch 133 step 90980 | 428 batches | lr 0.000439 | ms/batch 5119.29 | loss 0.73 | bpc 1.05165 | |
| epoch 133 step 90990 | 438 batches | lr 0.000439 | ms/batch 5119.67 | loss 0.77 | bpc 1.11531 | |
| epoch 133 step 91000 | 448 batches | lr 0.000439 | ms/batch 5120.20 | loss 0.77 | bpc 1.10507 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 364 at step 91000 | time: 1347.88s | valid loss 0.75 | bpc 1.07572 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 133 step 91010 | 458 batches | lr 0.000439 | ms/batch 11940.64 | loss 0.76 | bpc 1.10225 | |
| epoch 133 step 91020 | 468 batches | lr 0.000439 | ms/batch 5122.30 | loss 0.73 | bpc 1.04596 | |
| epoch 133 step 91030 | 478 batches | lr 0.000439 | ms/batch 5118.91 | loss 0.76 | bpc 1.09550 | |
| epoch 133 step 91040 | 488 batches | lr 0.000439 | ms/batch 5119.02 | loss 0.75 | bpc 1.08896 | |
| epoch 133 step 91050 | 498 batches | lr 0.000439 | ms/batch 5118.44 | loss 0.75 | bpc 1.07983 | |
| epoch 133 step 91060 | 508 batches | lr 0.000439 | ms/batch 5125.66 | loss 0.76 | bpc 1.08979 | |
| epoch 133 step 91070 | 518 batches | lr 0.000439 | ms/batch 5120.20 | loss 0.74 | bpc 1.07275 | |
| epoch 133 step 91080 | 528 batches | lr 0.000439 | ms/batch 5121.33 | loss 0.77 | bpc 1.11169 | |
| epoch 133 step 91090 | 538 batches | lr 0.000439 | ms/batch 5122.21 | loss 0.79 | bpc 1.14165 | |
| epoch 133 step 91100 | 548 batches | lr 0.000439 | ms/batch 5111.56 | loss 0.74 | bpc 1.06155 | |
| epoch 133 step 91110 | 558 batches | lr 0.000439 | ms/batch 5120.01 | loss 0.77 | bpc 1.10714 | |
| epoch 133 step 91120 | 568 batches | lr 0.000439 | ms/batch 5123.99 | loss 0.78 | bpc 1.12115 | |
| epoch 133 step 91130 | 578 batches | lr 0.000439 | ms/batch 5120.14 | loss 0.74 | bpc 1.06153 | |
| epoch 133 step 91140 | 588 batches | lr 0.000439 | ms/batch 5123.13 | loss 0.76 | bpc 1.09132 | |
| epoch 133 step 91150 | 598 batches | lr 0.000439 | ms/batch 5119.97 | loss 0.76 | bpc 1.09955 | |
| epoch 133 step 91160 | 608 batches | lr 0.000439 | ms/batch 5120.13 | loss 0.74 | bpc 1.06929 | |
| epoch 133 step 91170 | 618 batches | lr 0.000439 | ms/batch 5120.88 | loss 0.75 | bpc 1.08082 | |
| epoch 133 step 91180 | 628 batches | lr 0.000439 | ms/batch 5122.55 | loss 0.77 | bpc 1.10757 | |
| epoch 133 step 91190 | 638 batches | lr 0.000439 | ms/batch 5124.05 | loss 0.74 | bpc 1.07109 | |
| epoch 133 step 91200 | 648 batches | lr 0.000439 | ms/batch 5115.69 | loss 0.79 | bpc 1.14675 | |
| epoch 133 step 91210 | 658 batches | lr 0.000439 | ms/batch 5126.04 | loss 0.78 | bpc 1.12030 | |
| epoch 133 step 91220 | 668 batches | lr 0.000439 | ms/batch 5121.64 | loss 0.77 | bpc 1.11491 | |
| epoch 133 step 91230 | 678 batches | lr 0.000439 | ms/batch 5118.10 | loss 0.70 | bpc 1.00776 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 134 step 91240 | 2 batches | lr 0.000439 | ms/batch 5119.29 | loss 0.70 | bpc 1.00748 | |
| epoch 134 step 91250 | 12 batches | lr 0.000438 | ms/batch 5119.71 | loss 0.74 | bpc 1.07389 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 365 at step 91250 | time: 1364.88s | valid loss 0.74 | bpc 1.07403 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 134 step 91260 | 22 batches | lr 0.000438 | ms/batch 11939.70 | loss 0.77 | bpc 1.11210 | |
| epoch 134 step 91270 | 32 batches | lr 0.000438 | ms/batch 5125.48 | loss 0.74 | bpc 1.06260 | |
| epoch 134 step 91280 | 42 batches | lr 0.000438 | ms/batch 5123.30 | loss 0.72 | bpc 1.04584 | |
| epoch 134 step 91290 | 52 batches | lr 0.000438 | ms/batch 5121.10 | loss 0.76 | bpc 1.09389 | |
| epoch 134 step 91300 | 62 batches | lr 0.000438 | ms/batch 5124.09 | loss 0.76 | bpc 1.09346 | |
| epoch 134 step 91310 | 72 batches | lr 0.000438 | ms/batch 5120.03 | loss 0.76 | bpc 1.09210 | |
| epoch 134 step 91320 | 82 batches | lr 0.000438 | ms/batch 5122.85 | loss 0.70 | bpc 1.01647 | |
| epoch 134 step 91330 | 92 batches | lr 0.000438 | ms/batch 5122.64 | loss 0.72 | bpc 1.04324 | |
| epoch 134 step 91340 | 102 batches | lr 0.000438 | ms/batch 5121.83 | loss 0.74 | bpc 1.06071 | |
| epoch 134 step 91350 | 112 batches | lr 0.000438 | ms/batch 5120.23 | loss 0.75 | bpc 1.08647 | |
| epoch 134 step 91360 | 122 batches | lr 0.000438 | ms/batch 5120.56 | loss 0.75 | bpc 1.08814 | |
| epoch 134 step 91370 | 132 batches | lr 0.000438 | ms/batch 5125.54 | loss 0.76 | bpc 1.10039 | |
| epoch 134 step 91380 | 142 batches | lr 0.000438 | ms/batch 5122.66 | loss 0.80 | bpc 1.15846 | |
| epoch 134 step 91390 | 152 batches | lr 0.000438 | ms/batch 5125.40 | loss 0.76 | bpc 1.10032 | |
| epoch 134 step 91400 | 162 batches | lr 0.000438 | ms/batch 5122.63 | loss 0.79 | bpc 1.13451 | |
| epoch 134 step 91410 | 172 batches | lr 0.000438 | ms/batch 5121.27 | loss 0.71 | bpc 1.01871 | |
| epoch 134 step 91420 | 182 batches | lr 0.000438 | ms/batch 5117.23 | loss 0.68 | bpc 0.98026 | |
| epoch 134 step 91430 | 192 batches | lr 0.000438 | ms/batch 5122.47 | loss 0.69 | bpc 0.99124 | |
| epoch 134 step 91440 | 202 batches | lr 0.000438 | ms/batch 5123.32 | loss 0.61 | bpc 0.88409 | |
| epoch 134 step 91450 | 212 batches | lr 0.000438 | ms/batch 5126.42 | loss 0.73 | bpc 1.05204 | |
| epoch 134 step 91460 | 222 batches | lr 0.000438 | ms/batch 5120.11 | loss 0.76 | bpc 1.10027 | |
| epoch 134 step 91470 | 232 batches | lr 0.000438 | ms/batch 5119.94 | loss 0.74 | bpc 1.07314 | |
| epoch 134 step 91480 | 242 batches | lr 0.000438 | ms/batch 5120.24 | loss 0.74 | bpc 1.07302 | |
| epoch 134 step 91490 | 252 batches | lr 0.000438 | ms/batch 5125.06 | loss 0.79 | bpc 1.14510 | |
| epoch 134 step 91500 | 262 batches | lr 0.000438 | ms/batch 5120.84 | loss 0.72 | bpc 1.03439 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 366 at step 91500 | time: 1354.76s | valid loss 0.74 | bpc 1.07290 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 134 step 91510 | 272 batches | lr 0.000438 | ms/batch 11933.97 | loss 0.71 | bpc 1.02127 | |
| epoch 134 step 91520 | 282 batches | lr 0.000438 | ms/batch 5124.03 | loss 0.77 | bpc 1.10711 | |
| epoch 134 step 91530 | 292 batches | lr 0.000438 | ms/batch 5121.65 | loss 0.77 | bpc 1.10673 | |
| epoch 134 step 91540 | 302 batches | lr 0.000438 | ms/batch 5123.80 | loss 0.76 | bpc 1.09445 | |
| epoch 134 step 91550 | 312 batches | lr 0.000438 | ms/batch 5116.27 | loss 0.78 | bpc 1.12568 | |
| epoch 134 step 91560 | 322 batches | lr 0.000438 | ms/batch 5126.32 | loss 0.74 | bpc 1.06682 | |
| epoch 134 step 91570 | 332 batches | lr 0.000438 | ms/batch 5121.02 | loss 0.77 | bpc 1.11527 | |
| epoch 134 step 91580 | 342 batches | lr 0.000438 | ms/batch 5124.98 | loss 0.70 | bpc 1.01335 | |
| epoch 134 step 91590 | 352 batches | lr 0.000438 | ms/batch 5121.97 | loss 0.77 | bpc 1.10874 | |
| epoch 134 step 91600 | 362 batches | lr 0.000438 | ms/batch 5123.63 | loss 0.77 | bpc 1.11129 | |
| epoch 134 step 91610 | 372 batches | lr 0.000438 | ms/batch 5123.68 | loss 0.72 | bpc 1.04203 | |
| epoch 134 step 91620 | 382 batches | lr 0.000438 | ms/batch 5123.21 | loss 0.76 | bpc 1.09713 | |
| epoch 134 step 91630 | 392 batches | lr 0.000438 | ms/batch 5125.98 | loss 0.75 | bpc 1.08790 | |
| epoch 134 step 91640 | 402 batches | lr 0.000438 | ms/batch 5121.36 | loss 0.79 | bpc 1.14159 | |
| epoch 134 step 91650 | 412 batches | lr 0.000438 | ms/batch 5121.59 | loss 0.75 | bpc 1.08212 | |
| epoch 134 step 91660 | 422 batches | lr 0.000438 | ms/batch 5125.01 | loss 0.73 | bpc 1.05964 | |
| epoch 134 step 91670 | 432 batches | lr 0.000438 | ms/batch 5119.58 | loss 0.74 | bpc 1.07421 | |
| epoch 134 step 91680 | 442 batches | lr 0.000438 | ms/batch 5121.63 | loss 0.78 | bpc 1.12457 | |
| epoch 134 step 91690 | 452 batches | lr 0.000438 | ms/batch 5124.41 | loss 0.76 | bpc 1.09801 | |
| epoch 134 step 91700 | 462 batches | lr 0.000438 | ms/batch 5124.75 | loss 0.76 | bpc 1.09236 | |
| epoch 134 step 91710 | 472 batches | lr 0.000438 | ms/batch 5120.29 | loss 0.74 | bpc 1.07291 | |
| epoch 134 step 91720 | 482 batches | lr 0.000438 | ms/batch 5123.87 | loss 0.76 | bpc 1.09363 | |
| epoch 134 step 91730 | 492 batches | lr 0.000438 | ms/batch 5127.08 | loss 0.75 | bpc 1.07610 | |
| epoch 134 step 91740 | 502 batches | lr 0.000438 | ms/batch 5116.91 | loss 0.74 | bpc 1.07240 | |
| epoch 134 step 91750 | 512 batches | lr 0.000438 | ms/batch 5122.61 | loss 0.75 | bpc 1.07580 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 367 at step 91750 | time: 1348.93s | valid loss 0.75 | bpc 1.07552 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 134 step 91760 | 522 batches | lr 0.000438 | ms/batch 11937.75 | loss 0.76 | bpc 1.09140 | |
| epoch 134 step 91770 | 532 batches | lr 0.000438 | ms/batch 5124.05 | loss 0.79 | bpc 1.14535 | |
| epoch 134 step 91780 | 542 batches | lr 0.000438 | ms/batch 5125.27 | loss 0.75 | bpc 1.08582 | |
| epoch 134 step 91790 | 552 batches | lr 0.000438 | ms/batch 5122.81 | loss 0.76 | bpc 1.09162 | |
| epoch 134 step 91800 | 562 batches | lr 0.000438 | ms/batch 5118.09 | loss 0.78 | bpc 1.12371 | |
| epoch 134 step 91810 | 572 batches | lr 0.000438 | ms/batch 5118.24 | loss 0.75 | bpc 1.08403 | |
| epoch 134 step 91820 | 582 batches | lr 0.000438 | ms/batch 5124.38 | loss 0.73 | bpc 1.05600 | |
| epoch 134 step 91830 | 592 batches | lr 0.000438 | ms/batch 5119.11 | loss 0.78 | bpc 1.11963 | |
| epoch 134 step 91840 | 602 batches | lr 0.000438 | ms/batch 5126.73 | loss 0.74 | bpc 1.07304 | |
| epoch 134 step 91850 | 612 batches | lr 0.000438 | ms/batch 5120.16 | loss 0.73 | bpc 1.04737 | |
| epoch 134 step 91860 | 622 batches | lr 0.000438 | ms/batch 5120.81 | loss 0.77 | bpc 1.11122 | |
| epoch 134 step 91870 | 632 batches | lr 0.000438 | ms/batch 5120.20 | loss 0.77 | bpc 1.10538 | |
| epoch 134 step 91880 | 642 batches | lr 0.000438 | ms/batch 5120.47 | loss 0.76 | bpc 1.09339 | |
| epoch 134 step 91890 | 652 batches | lr 0.000438 | ms/batch 5121.04 | loss 0.79 | bpc 1.14010 | |
| epoch 134 step 91900 | 662 batches | lr 0.000438 | ms/batch 5122.39 | loss 0.78 | bpc 1.12818 | |
| epoch 134 step 91910 | 672 batches | lr 0.000438 | ms/batch 5121.94 | loss 0.72 | bpc 1.04372 | |
| epoch 134 step 91920 | 682 batches | lr 0.000438 | ms/batch 5122.30 | loss 0.71 | bpc 1.02469 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 135 step 91930 | 6 batches | lr 0.000438 | ms/batch 5119.01 | loss 0.71 | bpc 1.02902 | |
| epoch 135 step 91940 | 16 batches | lr 0.000438 | ms/batch 5126.68 | loss 0.75 | bpc 1.07787 | |
| epoch 135 step 91950 | 26 batches | lr 0.000438 | ms/batch 5123.15 | loss 0.78 | bpc 1.12506 | |
| epoch 135 step 91960 | 36 batches | lr 0.000438 | ms/batch 5121.75 | loss 0.71 | bpc 1.02348 | |
| epoch 135 step 91970 | 46 batches | lr 0.000438 | ms/batch 5119.93 | loss 0.74 | bpc 1.06674 | |
| epoch 135 step 91980 | 56 batches | lr 0.000438 | ms/batch 5115.76 | loss 0.77 | bpc 1.11783 | |
| epoch 135 step 91990 | 66 batches | lr 0.000438 | ms/batch 5120.60 | loss 0.76 | bpc 1.10003 | |
| epoch 135 step 92000 | 76 batches | lr 0.000438 | ms/batch 5127.21 | loss 0.72 | bpc 1.03862 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 368 at step 92000 | time: 1348.79s | valid loss 0.74 | bpc 1.07102 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 135 step 92010 | 86 batches | lr 0.000438 | ms/batch 11942.76 | loss 0.72 | bpc 1.03705 | |
| epoch 135 step 92020 | 96 batches | lr 0.000438 | ms/batch 5123.06 | loss 0.72 | bpc 1.03665 | |
| epoch 135 step 92030 | 106 batches | lr 0.000437 | ms/batch 5124.66 | loss 0.76 | bpc 1.08955 | |
| epoch 135 step 92040 | 116 batches | lr 0.000437 | ms/batch 5123.00 | loss 0.75 | bpc 1.07701 | |
| epoch 135 step 92050 | 126 batches | lr 0.000437 | ms/batch 5118.86 | loss 0.76 | bpc 1.09663 | |
| epoch 135 step 92060 | 136 batches | lr 0.000437 | ms/batch 5124.19 | loss 0.78 | bpc 1.12018 | |
| epoch 135 step 92070 | 146 batches | lr 0.000437 | ms/batch 5118.52 | loss 0.77 | bpc 1.11527 | |
| epoch 135 step 92080 | 156 batches | lr 0.000437 | ms/batch 5123.18 | loss 0.77 | bpc 1.11421 | |
| epoch 135 step 92090 | 166 batches | lr 0.000437 | ms/batch 5123.93 | loss 0.78 | bpc 1.12540 | |
| epoch 135 step 92100 | 176 batches | lr 0.000437 | ms/batch 5122.49 | loss 0.68 | bpc 0.98181 | |
| epoch 135 step 92110 | 186 batches | lr 0.000437 | ms/batch 5118.11 | loss 0.68 | bpc 0.98641 | |
| epoch 135 step 92120 | 196 batches | lr 0.000437 | ms/batch 5124.54 | loss 0.66 | bpc 0.95618 | |
| epoch 135 step 92130 | 206 batches | lr 0.000437 | ms/batch 5124.78 | loss 0.65 | bpc 0.94087 | |
| epoch 135 step 92140 | 216 batches | lr 0.000437 | ms/batch 5121.32 | loss 0.75 | bpc 1.08768 | |
| epoch 135 step 92150 | 226 batches | lr 0.000437 | ms/batch 5119.58 | loss 0.74 | bpc 1.06973 | |
| epoch 135 step 92160 | 236 batches | lr 0.000437 | ms/batch 5123.95 | loss 0.74 | bpc 1.06216 | |
| epoch 135 step 92170 | 246 batches | lr 0.000437 | ms/batch 5124.08 | loss 0.78 | bpc 1.12672 | |
| epoch 135 step 92180 | 256 batches | lr 0.000437 | ms/batch 5118.71 | loss 0.75 | bpc 1.08195 | |
| epoch 135 step 92190 | 266 batches | lr 0.000437 | ms/batch 5120.78 | loss 0.70 | bpc 1.01363 | |
| epoch 135 step 92200 | 276 batches | lr 0.000437 | ms/batch 5120.06 | loss 0.74 | bpc 1.07181 | |
| epoch 135 step 92210 | 286 batches | lr 0.000437 | ms/batch 5126.28 | loss 0.79 | bpc 1.13477 | |
| epoch 135 step 92220 | 296 batches | lr 0.000437 | ms/batch 5122.35 | loss 0.75 | bpc 1.07681 | |
| epoch 135 step 92230 | 306 batches | lr 0.000437 | ms/batch 5123.06 | loss 0.75 | bpc 1.08436 | |
| epoch 135 step 92240 | 316 batches | lr 0.000437 | ms/batch 5119.68 | loss 0.78 | bpc 1.11880 | |
| epoch 135 step 92250 | 326 batches | lr 0.000437 | ms/batch 5122.97 | loss 0.75 | bpc 1.08105 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 369 at step 92250 | time: 1348.85s | valid loss 0.74 | bpc 1.07153 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 135 step 92260 | 336 batches | lr 0.000437 | ms/batch 11938.75 | loss 0.75 | bpc 1.07953 | |
| epoch 135 step 92270 | 346 batches | lr 0.000437 | ms/batch 5120.34 | loss 0.72 | bpc 1.03947 | |
| epoch 135 step 92280 | 356 batches | lr 0.000437 | ms/batch 5120.88 | loss 0.78 | bpc 1.12837 | |
| epoch 135 step 92290 | 366 batches | lr 0.000437 | ms/batch 5117.57 | loss 0.74 | bpc 1.06870 | |
| epoch 135 step 92300 | 376 batches | lr 0.000437 | ms/batch 5120.35 | loss 0.73 | bpc 1.05865 | |
| epoch 135 step 92310 | 386 batches | lr 0.000437 | ms/batch 5117.74 | loss 0.77 | bpc 1.11322 | |
| epoch 135 step 92320 | 396 batches | lr 0.000437 | ms/batch 5119.63 | loss 0.76 | bpc 1.10073 | |
| epoch 135 step 92330 | 406 batches | lr 0.000437 | ms/batch 5126.25 | loss 0.77 | bpc 1.11209 | |
| epoch 135 step 92340 | 416 batches | lr 0.000437 | ms/batch 5121.94 | loss 0.75 | bpc 1.08439 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
| epoch 135 step 92350 | 426 batches | lr 0.000437 | ms/batch 5123.37 | loss 0.72 | bpc 1.03635 | |
| epoch 135 step 92360 | 436 batches | lr 0.000437 | ms/batch 5121.58 | loss 0.78 | bpc 1.12885 | |
| epoch 135 step 92370 | 446 batches | lr 0.000437 | ms/batch 5119.95 | loss 0.76 | bpc 1.09494 | |
| epoch 135 step 92380 | 456 batches | lr 0.000437 | ms/batch 5122.46 | loss 0.76 | bpc 1.09319 | |
| epoch 135 step 92390 | 466 batches | lr 0.000437 | ms/batch 5121.26 | loss 0.75 | bpc 1.08044 | |
| epoch 135 step 92400 | 476 batches | lr 0.000437 | ms/batch 5120.14 | loss 0.74 | bpc 1.06922 | |
| epoch 135 step 92410 | 486 batches | lr 0.000437 | ms/batch 5125.21 | loss 0.76 | bpc 1.09745 | |
| epoch 135 step 92420 | 496 batches | lr 0.000437 | ms/batch 5118.92 | loss 0.75 | bpc 1.08748 | |
| epoch 135 step 92430 | 506 batches | lr 0.000437 | ms/batch 5115.12 | loss 0.74 | bpc 1.06351 | |
| epoch 135 step 92440 | 516 batches | lr 0.000437 | ms/batch 5125.97 | loss 0.75 | bpc 1.08489 | |
| epoch 135 step 92450 | 526 batches | lr 0.000437 | ms/batch 5120.37 | loss 0.76 | bpc 1.10136 | |
| epoch 135 step 92460 | 536 batches | lr 0.000437 | ms/batch 5123.31 | loss 0.79 | bpc 1.13906 | |
| epoch 135 step 92470 | 546 batches | lr 0.000437 | ms/batch 5124.34 | loss 0.74 | bpc 1.07216 | |
| epoch 135 step 92480 | 556 batches | lr 0.000437 | ms/batch 5119.22 | loss 0.77 | bpc 1.10954 | |
| epoch 135 step 92490 | 566 batches | lr 0.000437 | ms/batch 5123.43 | loss 0.78 | bpc 1.12285 | |
| epoch 135 step 92500 | 576 batches | lr 0.000437 | ms/batch 5123.83 | loss 0.74 | bpc 1.06467 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 370 at step 92500 | time: 1348.64s | valid loss 0.74 | bpc 1.07452 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 135 step 92510 | 586 batches | lr 0.000437 | ms/batch 11942.68 | loss 0.74 | bpc 1.07290 | |
| epoch 135 step 92520 | 596 batches | lr 0.000437 | ms/batch 5123.11 | loss 0.77 | bpc 1.11339 | |
| epoch 135 step 92530 | 606 batches | lr 0.000437 | ms/batch 5124.56 | loss 0.74 | bpc 1.06615 | |
| epoch 135 step 92540 | 616 batches | lr 0.000437 | ms/batch 5123.49 | loss 0.74 | bpc 1.06559 | |
| epoch 135 step 92550 | 626 batches | lr 0.000437 | ms/batch 5124.33 | loss 0.76 | bpc 1.09354 | |
| epoch 135 step 92560 | 636 batches | lr 0.000437 | ms/batch 5113.53 | loss 0.77 | bpc 1.11025 | |
| epoch 135 step 92570 | 646 batches | lr 0.000437 | ms/batch 5123.59 | loss 0.76 | bpc 1.10032 | |
| epoch 135 step 92580 | 656 batches | lr 0.000437 | ms/batch 5125.44 | loss 0.79 | bpc 1.13868 | |
| epoch 135 step 92590 | 666 batches | lr 0.000437 | ms/batch 5120.03 | loss 0.78 | bpc 1.12575 | |
| epoch 135 step 92600 | 676 batches | lr 0.000437 | ms/batch 5122.01 | loss 0.70 | bpc 1.01207 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
| epoch 135 step 92610 | 686 batches | lr 0.000437 | ms/batch 5122.36 | loss 0.70 | bpc 1.00351 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 136 step 92620 | 10 batches | lr 0.000437 | ms/batch 5122.12 | loss 0.73 | bpc 1.05398 | |
| epoch 136 step 92630 | 20 batches | lr 0.000437 | ms/batch 5122.74 | loss 0.76 | bpc 1.09805 | |
| epoch 136 step 92640 | 30 batches | lr 0.000437 | ms/batch 5119.12 | loss 0.76 | bpc 1.08943 | |
| epoch 136 step 92650 | 40 batches | lr 0.000437 | ms/batch 5126.25 | loss 0.72 | bpc 1.03534 | |
| epoch 136 step 92660 | 50 batches | lr 0.000437 | ms/batch 5118.66 | loss 0.76 | bpc 1.09752 | |
| epoch 136 step 92670 | 60 batches | lr 0.000437 | ms/batch 5123.61 | loss 0.77 | bpc 1.11096 | |
| epoch 136 step 92680 | 70 batches | lr 0.000437 | ms/batch 5120.75 | loss 0.75 | bpc 1.07573 | |
| epoch 136 step 92690 | 80 batches | lr 0.000437 | ms/batch 5124.45 | loss 0.71 | bpc 1.02107 | |
| epoch 136 step 92700 | 90 batches | lr 0.000437 | ms/batch 5121.95 | loss 0.74 | bpc 1.06097 | |
| epoch 136 step 92710 | 100 batches | lr 0.000437 | ms/batch 5123.19 | loss 0.71 | bpc 1.02251 | |
| epoch 136 step 92720 | 110 batches | lr 0.000437 | ms/batch 5113.39 | loss 0.77 | bpc 1.10710 | |
| epoch 136 step 92730 | 120 batches | lr 0.000437 | ms/batch 5123.65 | loss 0.74 | bpc 1.06208 | |
| epoch 136 step 92740 | 130 batches | lr 0.000437 | ms/batch 5125.91 | loss 0.78 | bpc 1.12550 | |
| epoch 136 step 92750 | 140 batches | lr 0.000437 | ms/batch 5127.09 | loss 0.80 | bpc 1.14714 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 371 at step 92750 | time: 1349.11s | valid loss 0.74 | bpc 1.07145 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 136 step 92760 | 150 batches | lr 0.000437 | ms/batch 11945.82 | loss 0.75 | bpc 1.08482 | |
| epoch 136 step 92770 | 160 batches | lr 0.000437 | ms/batch 5123.42 | loss 0.79 | bpc 1.13710 | |
| epoch 136 step 92780 | 170 batches | lr 0.000437 | ms/batch 5121.72 | loss 0.73 | bpc 1.05799 | |
| epoch 136 step 92790 | 180 batches | lr 0.000436 | ms/batch 5123.36 | loss 0.68 | bpc 0.97656 | |
| epoch 136 step 92800 | 190 batches | lr 0.000436 | ms/batch 5124.59 | loss 0.68 | bpc 0.97435 | |
| epoch 136 step 92810 | 200 batches | lr 0.000436 | ms/batch 5127.55 | loss 0.63 | bpc 0.91498 | |
| epoch 136 step 92820 | 210 batches | lr 0.000436 | ms/batch 5126.69 | loss 0.70 | bpc 1.01295 | |
| epoch 136 step 92830 | 220 batches | lr 0.000436 | ms/batch 5123.19 | loss 0.77 | bpc 1.10973 | |
| epoch 136 step 92840 | 230 batches | lr 0.000436 | ms/batch 5124.17 | loss 0.74 | bpc 1.07001 | |
| epoch 136 step 92850 | 240 batches | lr 0.000436 | ms/batch 5125.55 | loss 0.74 | bpc 1.07058 | |
| epoch 136 step 92860 | 250 batches | lr 0.000436 | ms/batch 5121.88 | loss 0.79 | bpc 1.13413 | |
| epoch 136 step 92870 | 260 batches | lr 0.000436 | ms/batch 5121.36 | loss 0.73 | bpc 1.04599 | |
| epoch 136 step 92880 | 270 batches | lr 0.000436 | ms/batch 5113.14 | loss 0.71 | bpc 1.02279 | |
| epoch 136 step 92890 | 280 batches | lr 0.000436 | ms/batch 5129.46 | loss 0.75 | bpc 1.08839 | |
| epoch 136 step 92900 | 290 batches | lr 0.000436 | ms/batch 5126.24 | loss 0.78 | bpc 1.12098 | |
| epoch 136 step 92910 | 300 batches | lr 0.000436 | ms/batch 5123.44 | loss 0.75 | bpc 1.08574 | |
| epoch 136 step 92920 | 310 batches | lr 0.000436 | ms/batch 5122.10 | loss 0.77 | bpc 1.11493 | |
| epoch 136 step 92930 | 320 batches | lr 0.000436 | ms/batch 5123.49 | loss 0.74 | bpc 1.06234 | |
| epoch 136 step 92940 | 330 batches | lr 0.000436 | ms/batch 5120.29 | loss 0.76 | bpc 1.10155 | |
| epoch 136 step 92950 | 340 batches | lr 0.000436 | ms/batch 5124.48 | loss 0.73 | bpc 1.04707 | |
| epoch 136 step 92960 | 350 batches | lr 0.000436 | ms/batch 5127.02 | loss 0.74 | bpc 1.07461 | |
| epoch 136 step 92970 | 360 batches | lr 0.000436 | ms/batch 5123.65 | loss 0.79 | bpc 1.13808 | |
| epoch 136 step 92980 | 370 batches | lr 0.000436 | ms/batch 5120.34 | loss 0.73 | bpc 1.05712 | |
| epoch 136 step 92990 | 380 batches | lr 0.000436 | ms/batch 5128.76 | loss 0.75 | bpc 1.08118 | |
| epoch 136 step 93000 | 390 batches | lr 0.000436 | ms/batch 5121.16 | loss 0.75 | bpc 1.07838 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 372 at step 93000 | time: 1352.88s | valid loss 0.74 | bpc 1.07187 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 136 step 93010 | 400 batches | lr 0.000436 | ms/batch 11945.09 | loss 0.79 | bpc 1.13832 | |
| epoch 136 step 93020 | 410 batches | lr 0.000436 | ms/batch 5125.38 | loss 0.76 | bpc 1.09491 | |
| epoch 136 step 93030 | 420 batches | lr 0.000436 | ms/batch 5125.95 | loss 0.73 | bpc 1.05675 | |
| epoch 136 step 93040 | 430 batches | lr 0.000436 | ms/batch 5123.19 | loss 0.74 | bpc 1.06510 | |
| epoch 136 step 93050 | 440 batches | lr 0.000436 | ms/batch 5114.47 | loss 0.77 | bpc 1.11440 | |
| epoch 136 step 93060 | 450 batches | lr 0.000436 | ms/batch 5118.84 | loss 0.76 | bpc 1.09957 | |
| epoch 136 step 93070 | 460 batches | lr 0.000436 | ms/batch 5127.15 | loss 0.76 | bpc 1.09381 | |
| epoch 136 step 93080 | 470 batches | lr 0.000436 | ms/batch 5125.75 | loss 0.73 | bpc 1.05676 | |
| epoch 136 step 93090 | 480 batches | lr 0.000436 | ms/batch 5124.33 | loss 0.77 | bpc 1.10690 | |
| epoch 136 step 93100 | 490 batches | lr 0.000436 | ms/batch 5126.13 | loss 0.74 | bpc 1.07454 | |
| epoch 136 step 93110 | 500 batches | lr 0.000436 | ms/batch 5124.39 | loss 0.74 | bpc 1.06879 | |
| epoch 136 step 93120 | 510 batches | lr 0.000436 | ms/batch 5120.59 | loss 0.75 | bpc 1.07689 | |
| epoch 136 step 93130 | 520 batches | lr 0.000436 | ms/batch 5120.14 | loss 0.76 | bpc 1.09218 | |
| epoch 136 step 93140 | 530 batches | lr 0.000436 | ms/batch 5119.96 | loss 0.77 | bpc 1.11468 | |
| epoch 136 step 93150 | 540 batches | lr 0.000436 | ms/batch 5123.98 | loss 0.78 | bpc 1.12716 | |
| epoch 136 step 93160 | 550 batches | lr 0.000436 | ms/batch 5129.19 | loss 0.74 | bpc 1.06615 | |
| epoch 136 step 93170 | 560 batches | lr 0.000436 | ms/batch 5126.45 | loss 0.77 | bpc 1.11445 | |
| epoch 136 step 93180 | 570 batches | lr 0.000436 | ms/batch 5120.76 | loss 0.76 | bpc 1.09928 | |
| epoch 136 step 93190 | 580 batches | lr 0.000436 | ms/batch 5119.55 | loss 0.74 | bpc 1.07053 | |
| epoch 136 step 93200 | 590 batches | lr 0.000436 | ms/batch 5118.19 | loss 0.75 | bpc 1.08633 | |
| epoch 136 step 93210 | 600 batches | lr 0.000436 | ms/batch 5120.93 | loss 0.76 | bpc 1.08994 | |
| epoch 136 step 93220 | 610 batches | lr 0.000436 | ms/batch 5121.75 | loss 0.73 | bpc 1.05895 | |
| epoch 136 step 93230 | 620 batches | lr 0.000436 | ms/batch 5127.13 | loss 0.76 | bpc 1.09280 | |
| epoch 136 step 93240 | 630 batches | lr 0.000436 | ms/batch 5120.74 | loss 0.77 | bpc 1.10741 | |
| epoch 136 step 93250 | 640 batches | lr 0.000436 | ms/batch 5122.27 | loss 0.75 | bpc 1.08669 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 373 at step 93250 | time: 1348.88s | valid loss 0.74 | bpc 1.07242 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 136 step 93260 | 650 batches | lr 0.000436 | ms/batch 11927.68 | loss 0.79 | bpc 1.14331 | |
| epoch 136 step 93270 | 660 batches | lr 0.000436 | ms/batch 5171.41 | loss 0.77 | bpc 1.11696 | |
| epoch 136 step 93280 | 670 batches | lr 0.000436 | ms/batch 5193.36 | loss 0.75 | bpc 1.08359 | |
| epoch 136 step 93290 | 680 batches | lr 0.000436 | ms/batch 5184.30 | loss 0.70 | bpc 1.01391 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 137 step 93300 | 4 batches | lr 0.000436 | ms/batch 5132.94 | loss 0.70 | bpc 1.00973 | |
| epoch 137 step 93310 | 14 batches | lr 0.000436 | ms/batch 5125.01 | loss 0.74 | bpc 1.07193 | |
| epoch 137 step 93320 | 24 batches | lr 0.000436 | ms/batch 5121.24 | loss 0.77 | bpc 1.11415 | |
| epoch 137 step 93330 | 34 batches | lr 0.000436 | ms/batch 5128.98 | loss 0.73 | bpc 1.05320 | |
| epoch 137 step 93340 | 44 batches | lr 0.000436 | ms/batch 5169.73 | loss 0.73 | bpc 1.04993 | |
| epoch 137 step 93350 | 54 batches | lr 0.000436 | ms/batch 5131.65 | loss 0.77 | bpc 1.11105 | |
| epoch 137 step 93360 | 64 batches | lr 0.000436 | ms/batch 5150.06 | loss 0.75 | bpc 1.08663 | |
| epoch 137 step 93370 | 74 batches | lr 0.000436 | ms/batch 5162.11 | loss 0.74 | bpc 1.06857 | |
| epoch 137 step 93380 | 84 batches | lr 0.000436 | ms/batch 5170.18 | loss 0.70 | bpc 1.01583 | |
| epoch 137 step 93390 | 94 batches | lr 0.000436 | ms/batch 5212.89 | loss 0.72 | bpc 1.04192 | |
| epoch 137 step 93400 | 104 batches | lr 0.000436 | ms/batch 5223.41 | loss 0.75 | bpc 1.07514 | |
| epoch 137 step 93410 | 114 batches | lr 0.000436 | ms/batch 5175.11 | loss 0.75 | bpc 1.07902 | |
| epoch 137 step 93420 | 124 batches | lr 0.000436 | ms/batch 5204.32 | loss 0.76 | bpc 1.09765 | |
| epoch 137 step 93430 | 134 batches | lr 0.000436 | ms/batch 5218.20 | loss 0.76 | bpc 1.09961 | |
| epoch 137 step 93440 | 144 batches | lr 0.000436 | ms/batch 5243.48 | loss 0.79 | bpc 1.13649 | |
| epoch 137 step 93450 | 154 batches | lr 0.000436 | ms/batch 5164.13 | loss 0.77 | bpc 1.10429 | |
| epoch 137 step 93460 | 164 batches | lr 0.000436 | ms/batch 5217.47 | loss 0.79 | bpc 1.13742 | |
| epoch 137 step 93470 | 174 batches | lr 0.000436 | ms/batch 5210.08 | loss 0.68 | bpc 0.98510 | |
| epoch 137 step 93480 | 184 batches | lr 0.000436 | ms/batch 5241.20 | loss 0.68 | bpc 0.98751 | |
| epoch 137 step 93490 | 194 batches | lr 0.000436 | ms/batch 5196.69 | loss 0.69 | bpc 0.99812 | |
| epoch 137 step 93500 | 204 batches | lr 0.000436 | ms/batch 5212.62 | loss 0.62 | bpc 0.89872 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 374 at step 93500 | time: 1364.28s | valid loss 0.74 | bpc 1.06755 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 137 step 93510 | 214 batches | lr 0.000436 | ms/batch 12621.16 | loss 0.73 | bpc 1.05456 | |
| epoch 137 step 93520 | 224 batches | lr 0.000436 | ms/batch 5139.73 | loss 0.75 | bpc 1.07546 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
| epoch 137 step 93530 | 234 batches | lr 0.000436 | ms/batch 5115.56 | loss 0.74 | bpc 1.06832 | |
| epoch 137 step 93540 | 244 batches | lr 0.000436 | ms/batch 5118.86 | loss 0.76 | bpc 1.09369 | |
| epoch 137 step 93550 | 254 batches | lr 0.000436 | ms/batch 5164.09 | loss 0.78 | bpc 1.12919 | |
| epoch 137 step 93560 | 264 batches | lr 0.000435 | ms/batch 5186.32 | loss 0.70 | bpc 1.01676 | |
| epoch 137 step 93570 | 274 batches | lr 0.000435 | ms/batch 5165.77 | loss 0.72 | bpc 1.04330 | |
| epoch 137 step 93580 | 284 batches | lr 0.000435 | ms/batch 5201.85 | loss 0.77 | bpc 1.11192 | |
| epoch 137 step 93590 | 294 batches | lr 0.000435 | ms/batch 5195.54 | loss 0.76 | bpc 1.09840 | |
| epoch 137 step 93600 | 304 batches | lr 0.000435 | ms/batch 5192.47 | loss 0.76 | bpc 1.09505 | |
| epoch 137 step 93610 | 314 batches | lr 0.000435 | ms/batch 5221.33 | loss 0.77 | bpc 1.10992 | |
| epoch 137 step 93620 | 324 batches | lr 0.000435 | ms/batch 5203.58 | loss 0.75 | bpc 1.08079 | |
| epoch 137 step 93630 | 334 batches | lr 0.000435 | ms/batch 5168.22 | loss 0.76 | bpc 1.09462 | |
| epoch 137 step 93640 | 344 batches | lr 0.000435 | ms/batch 5183.99 | loss 0.71 | bpc 1.02580 | |
| epoch 137 step 93650 | 354 batches | lr 0.000435 | ms/batch 5217.09 | loss 0.78 | bpc 1.12581 | |
| epoch 137 step 93660 | 364 batches | lr 0.000435 | ms/batch 5200.25 | loss 0.75 | bpc 1.07971 | |
| epoch 137 step 93670 | 374 batches | lr 0.000435 | ms/batch 5210.11 | loss 0.72 | bpc 1.04414 | |
| epoch 137 step 93680 | 384 batches | lr 0.000435 | ms/batch 5203.70 | loss 0.76 | bpc 1.10189 | |
| epoch 137 step 93690 | 394 batches | lr 0.000435 | ms/batch 5180.42 | loss 0.76 | bpc 1.09135 | |
| epoch 137 step 93700 | 404 batches | lr 0.000435 | ms/batch 5176.68 | loss 0.77 | bpc 1.11452 | |
| epoch 137 step 93710 | 414 batches | lr 0.000435 | ms/batch 5205.28 | loss 0.76 | bpc 1.09466 | |
| epoch 137 step 93720 | 424 batches | lr 0.000435 | ms/batch 5187.01 | loss 0.72 | bpc 1.04432 | |
| epoch 137 step 93730 | 434 batches | lr 0.000435 | ms/batch 5197.36 | loss 0.76 | bpc 1.10224 | |
| epoch 137 step 93740 | 444 batches | lr 0.000435 | ms/batch 5218.37 | loss 0.76 | bpc 1.10300 | |
| epoch 137 step 93750 | 454 batches | lr 0.000435 | ms/batch 5212.56 | loss 0.77 | bpc 1.11054 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 375 at step 93750 | time: 1371.10s | valid loss 0.75 | bpc 1.07559 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 137 step 93760 | 464 batches | lr 0.000435 | ms/batch 12001.39 | loss 0.75 | bpc 1.08236 | |
| epoch 137 step 93770 | 474 batches | lr 0.000435 | ms/batch 5183.62 | loss 0.74 | bpc 1.06372 | |
| epoch 137 step 93780 | 484 batches | lr 0.000435 | ms/batch 5154.18 | loss 0.76 | bpc 1.09129 | |
| epoch 137 step 93790 | 494 batches | lr 0.000435 | ms/batch 5159.11 | loss 0.76 | bpc 1.09560 | |
| epoch 137 step 93800 | 504 batches | lr 0.000435 | ms/batch 5152.80 | loss 0.74 | bpc 1.06099 | |
| epoch 137 step 93810 | 514 batches | lr 0.000435 | ms/batch 5153.74 | loss 0.74 | bpc 1.07025 | |
| epoch 137 step 93820 | 524 batches | lr 0.000435 | ms/batch 5160.83 | loss 0.75 | bpc 1.08716 | |
| epoch 137 step 93830 | 534 batches | lr 0.000435 | ms/batch 5153.33 | loss 0.80 | bpc 1.15232 | |
| epoch 137 step 93840 | 544 batches | lr 0.000435 | ms/batch 5155.31 | loss 0.74 | bpc 1.07454 | |
| epoch 137 step 93850 | 554 batches | lr 0.000435 | ms/batch 5169.34 | loss 0.76 | bpc 1.10187 | |
| epoch 137 step 93860 | 564 batches | lr 0.000435 | ms/batch 5158.10 | loss 0.77 | bpc 1.11545 | |
| epoch 137 step 93870 | 574 batches | lr 0.000435 | ms/batch 5158.42 | loss 0.75 | bpc 1.07490 | |
| epoch 137 step 93880 | 584 batches | lr 0.000435 | ms/batch 5153.78 | loss 0.74 | bpc 1.06443 | |
| epoch 137 step 93890 | 594 batches | lr 0.000435 | ms/batch 5156.90 | loss 0.77 | bpc 1.10882 | |
| epoch 137 step 93900 | 604 batches | lr 0.000435 | ms/batch 5151.98 | loss 0.75 | bpc 1.08118 | |
| epoch 137 step 93910 | 614 batches | lr 0.000435 | ms/batch 5135.47 | loss 0.73 | bpc 1.04872 | |
| epoch 137 step 93920 | 624 batches | lr 0.000435 | ms/batch 5160.83 | loss 0.76 | bpc 1.09569 | |
| epoch 137 step 93930 | 634 batches | lr 0.000435 | ms/batch 5172.54 | loss 0.77 | bpc 1.11162 | |
| epoch 137 step 93940 | 644 batches | lr 0.000435 | ms/batch 5147.93 | loss 0.75 | bpc 1.08028 | |
| epoch 137 step 93950 | 654 batches | lr 0.000435 | ms/batch 5160.18 | loss 0.80 | bpc 1.15807 | |
| epoch 137 step 93960 | 664 batches | lr 0.000435 | ms/batch 5174.04 | loss 0.78 | bpc 1.11910 | |
| epoch 137 step 93970 | 674 batches | lr 0.000435 | ms/batch 5170.65 | loss 0.71 | bpc 1.02453 | |
| epoch 137 step 93980 | 684 batches | lr 0.000435 | ms/batch 5188.26 | loss 0.70 | bpc 1.01341 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 138 step 93990 | 8 batches | lr 0.000435 | ms/batch 5182.10 | loss 0.72 | bpc 1.04069 | |
| epoch 138 step 94000 | 18 batches | lr 0.000435 | ms/batch 5168.00 | loss 0.75 | bpc 1.08499 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 376 at step 94000 | time: 1362.85s | valid loss 0.75 | bpc 1.07588 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 138 step 94010 | 28 batches | lr 0.000435 | ms/batch 12006.86 | loss 0.77 | bpc 1.10562 | |
| epoch 138 step 94020 | 38 batches | lr 0.000435 | ms/batch 5185.89 | loss 0.72 | bpc 1.03517 | |
| epoch 138 step 94030 | 48 batches | lr 0.000435 | ms/batch 5174.27 | loss 0.74 | bpc 1.06706 | |
| epoch 138 step 94040 | 58 batches | lr 0.000435 | ms/batch 5186.87 | loss 0.77 | bpc 1.11073 | |
| epoch 138 step 94050 | 68 batches | lr 0.000435 | ms/batch 5191.17 | loss 0.76 | bpc 1.09410 | |
| epoch 138 step 94060 | 78 batches | lr 0.000435 | ms/batch 5166.96 | loss 0.71 | bpc 1.02675 | |
| epoch 138 step 94070 | 88 batches | lr 0.000435 | ms/batch 5183.24 | loss 0.72 | bpc 1.04319 | |
| epoch 138 step 94080 | 98 batches | lr 0.000435 | ms/batch 5174.32 | loss 0.71 | bpc 1.02535 | |
| epoch 138 step 94090 | 108 batches | lr 0.000435 | ms/batch 5166.65 | loss 0.77 | bpc 1.10680 | |
| epoch 138 step 94100 | 118 batches | lr 0.000435 | ms/batch 5169.01 | loss 0.74 | bpc 1.06946 | |
| epoch 138 step 94110 | 128 batches | lr 0.000435 | ms/batch 5172.20 | loss 0.77 | bpc 1.11680 | |
| epoch 138 step 94120 | 138 batches | lr 0.000435 | ms/batch 5166.17 | loss 0.78 | bpc 1.12942 | |
| epoch 138 step 94130 | 148 batches | lr 0.000435 | ms/batch 5162.64 | loss 0.75 | bpc 1.07788 | |
| epoch 138 step 94140 | 158 batches | lr 0.000435 | ms/batch 5168.57 | loss 0.79 | bpc 1.13732 | |
| epoch 138 step 94150 | 168 batches | lr 0.000435 | ms/batch 5206.59 | loss 0.75 | bpc 1.08776 | |
| epoch 138 step 94160 | 178 batches | lr 0.000435 | ms/batch 5204.62 | loss 0.67 | bpc 0.97335 | |
| epoch 138 step 94170 | 188 batches | lr 0.000435 | ms/batch 5205.50 | loss 0.68 | bpc 0.97914 | |
| epoch 138 step 94180 | 198 batches | lr 0.000435 | ms/batch 5177.71 | loss 0.65 | bpc 0.93954 | |
| epoch 138 step 94190 | 208 batches | lr 0.000435 | ms/batch 5177.90 | loss 0.67 | bpc 0.97060 | |
| epoch 138 step 94200 | 218 batches | lr 0.000435 | ms/batch 5160.02 | loss 0.76 | bpc 1.09306 | |
| epoch 138 step 94210 | 228 batches | lr 0.000435 | ms/batch 5179.14 | loss 0.74 | bpc 1.07340 | |
| epoch 138 step 94220 | 238 batches | lr 0.000435 | ms/batch 5178.22 | loss 0.74 | bpc 1.06503 | |
| epoch 138 step 94230 | 248 batches | lr 0.000435 | ms/batch 5177.63 | loss 0.78 | bpc 1.12385 | |
| epoch 138 step 94240 | 258 batches | lr 0.000435 | ms/batch 5170.73 | loss 0.75 | bpc 1.07925 | |
| epoch 138 step 94250 | 268 batches | lr 0.000435 | ms/batch 5181.85 | loss 0.70 | bpc 1.00831 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 377 at step 94250 | time: 1363.07s | valid loss 0.74 | bpc 1.06965 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 138 step 94260 | 278 batches | lr 0.000435 | ms/batch 11947.23 | loss 0.75 | bpc 1.07843 | |
| epoch 138 step 94270 | 288 batches | lr 0.000435 | ms/batch 5175.94 | loss 0.78 | bpc 1.13199 | |
| epoch 138 step 94280 | 298 batches | lr 0.000435 | ms/batch 5182.28 | loss 0.75 | bpc 1.07935 | |
| epoch 138 step 94290 | 308 batches | lr 0.000435 | ms/batch 5172.44 | loss 0.76 | bpc 1.09519 | |
| epoch 138 step 94300 | 318 batches | lr 0.000435 | ms/batch 5176.76 | loss 0.76 | bpc 1.09535 | |
| epoch 138 step 94310 | 328 batches | lr 0.000434 | ms/batch 5156.50 | loss 0.76 | bpc 1.09502 | |
| epoch 138 step 94320 | 338 batches | lr 0.000434 | ms/batch 5173.13 | loss 0.73 | bpc 1.05426 | |
| epoch 138 step 94330 | 348 batches | lr 0.000434 | ms/batch 5174.36 | loss 0.72 | bpc 1.04226 | |
| epoch 138 step 94340 | 358 batches | lr 0.000434 | ms/batch 5171.27 | loss 0.79 | bpc 1.14506 | |
| epoch 138 step 94350 | 368 batches | lr 0.000434 | ms/batch 5163.74 | loss 0.74 | bpc 1.06686 | |
| epoch 138 step 94360 | 378 batches | lr 0.000434 | ms/batch 5180.91 | loss 0.74 | bpc 1.06722 | |
| epoch 138 step 94370 | 388 batches | lr 0.000434 | ms/batch 5161.36 | loss 0.77 | bpc 1.11238 | |
| epoch 138 step 94380 | 398 batches | lr 0.000434 | ms/batch 5157.35 | loss 0.77 | bpc 1.10399 | |
| epoch 138 step 94390 | 408 batches | lr 0.000434 | ms/batch 5166.62 | loss 0.77 | bpc 1.10810 | |
| epoch 138 step 94400 | 418 batches | lr 0.000434 | ms/batch 5167.67 | loss 0.74 | bpc 1.06112 | |
| epoch 138 step 94410 | 428 batches | lr 0.000434 | ms/batch 5172.25 | loss 0.73 | bpc 1.05076 | |
| epoch 138 step 94420 | 438 batches | lr 0.000434 | ms/batch 5179.27 | loss 0.77 | bpc 1.11392 | |
| epoch 138 step 94430 | 448 batches | lr 0.000434 | ms/batch 5175.88 | loss 0.76 | bpc 1.10206 | |
| epoch 138 step 94440 | 458 batches | lr 0.000434 | ms/batch 5154.07 | loss 0.76 | bpc 1.10172 | |
| epoch 138 step 94450 | 468 batches | lr 0.000434 | ms/batch 5171.02 | loss 0.73 | bpc 1.04677 | |
| epoch 138 step 94460 | 478 batches | lr 0.000434 | ms/batch 5176.02 | loss 0.76 | bpc 1.09384 | |
| epoch 138 step 94470 | 488 batches | lr 0.000434 | ms/batch 5176.10 | loss 0.75 | bpc 1.08262 | |
| epoch 138 step 94480 | 498 batches | lr 0.000434 | ms/batch 5177.77 | loss 0.75 | bpc 1.07801 | |
| epoch 138 step 94490 | 508 batches | lr 0.000434 | ms/batch 5174.29 | loss 0.75 | bpc 1.08594 | |
| epoch 138 step 94500 | 518 batches | lr 0.000434 | ms/batch 5176.13 | loss 0.74 | bpc 1.07081 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 378 at step 94500 | time: 1360.97s | valid loss 0.75 | bpc 1.07687 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 138 step 94510 | 528 batches | lr 0.000434 | ms/batch 12001.55 | loss 0.77 | bpc 1.10757 | |
| epoch 138 step 94520 | 538 batches | lr 0.000434 | ms/batch 5180.03 | loss 0.79 | bpc 1.13825 | |
| epoch 138 step 94530 | 548 batches | lr 0.000434 | ms/batch 5168.14 | loss 0.73 | bpc 1.05871 | |
| epoch 138 step 94540 | 558 batches | lr 0.000434 | ms/batch 5164.24 | loss 0.76 | bpc 1.10223 | |
| epoch 138 step 94550 | 568 batches | lr 0.000434 | ms/batch 5163.19 | loss 0.77 | bpc 1.11740 | |
| epoch 138 step 94560 | 578 batches | lr 0.000434 | ms/batch 5155.72 | loss 0.73 | bpc 1.05834 | |
| epoch 138 step 94570 | 588 batches | lr 0.000434 | ms/batch 5166.50 | loss 0.75 | bpc 1.08571 | |
| epoch 138 step 94580 | 598 batches | lr 0.000434 | ms/batch 5174.21 | loss 0.76 | bpc 1.09762 | |
| epoch 138 step 94590 | 608 batches | lr 0.000434 | ms/batch 5177.91 | loss 0.74 | bpc 1.06610 | |
| epoch 138 step 94600 | 618 batches | lr 0.000434 | ms/batch 5189.60 | loss 0.75 | bpc 1.07733 | |
| epoch 138 step 94610 | 628 batches | lr 0.000434 | ms/batch 5178.79 | loss 0.77 | bpc 1.10469 | |
| epoch 138 step 94620 | 638 batches | lr 0.000434 | ms/batch 5181.69 | loss 0.74 | bpc 1.06901 | |
| epoch 138 step 94630 | 648 batches | lr 0.000434 | ms/batch 5136.32 | loss 0.79 | bpc 1.14431 | |
| epoch 138 step 94640 | 658 batches | lr 0.000434 | ms/batch 5132.51 | loss 0.77 | bpc 1.11659 | |
| epoch 138 step 94650 | 668 batches | lr 0.000434 | ms/batch 5140.15 | loss 0.77 | bpc 1.11276 | |
| epoch 138 step 94660 | 678 batches | lr 0.000434 | ms/batch 5175.81 | loss 0.70 | bpc 1.00657 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 139 step 94670 | 2 batches | lr 0.000434 | ms/batch 5188.81 | loss 0.70 | bpc 1.00286 | |
| epoch 139 step 94680 | 12 batches | lr 0.000434 | ms/batch 5124.77 | loss 0.74 | bpc 1.06950 | |
| epoch 139 step 94690 | 22 batches | lr 0.000434 | ms/batch 5122.41 | loss 0.77 | bpc 1.10895 | |
| epoch 139 step 94700 | 32 batches | lr 0.000434 | ms/batch 5121.74 | loss 0.73 | bpc 1.05845 | |
| epoch 139 step 94710 | 42 batches | lr 0.000434 | ms/batch 5130.95 | loss 0.72 | bpc 1.04099 | |
| epoch 139 step 94720 | 52 batches | lr 0.000434 | ms/batch 5124.04 | loss 0.76 | bpc 1.09174 | |
| epoch 139 step 94730 | 62 batches | lr 0.000434 | ms/batch 5124.90 | loss 0.76 | bpc 1.09084 | |
| epoch 139 step 94740 | 72 batches | lr 0.000434 | ms/batch 5125.21 | loss 0.76 | bpc 1.09041 | |
| epoch 139 step 94750 | 82 batches | lr 0.000434 | ms/batch 5126.28 | loss 0.70 | bpc 1.01592 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 379 at step 94750 | time: 1357.96s | valid loss 0.75 | bpc 1.07621 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 139 step 94760 | 92 batches | lr 0.000434 | ms/batch 11943.53 | loss 0.72 | bpc 1.04068 | |
| epoch 139 step 94770 | 102 batches | lr 0.000434 | ms/batch 5127.97 | loss 0.73 | bpc 1.05828 | |
| epoch 139 step 94780 | 112 batches | lr 0.000434 | ms/batch 5123.15 | loss 0.75 | bpc 1.08323 | |
| epoch 139 step 94790 | 122 batches | lr 0.000434 | ms/batch 5125.81 | loss 0.75 | bpc 1.08411 | |
| epoch 139 step 94800 | 132 batches | lr 0.000434 | ms/batch 5125.96 | loss 0.76 | bpc 1.09974 | |
| epoch 139 step 94810 | 142 batches | lr 0.000434 | ms/batch 5118.80 | loss 0.80 | bpc 1.15547 | |
| epoch 139 step 94820 | 152 batches | lr 0.000434 | ms/batch 5123.82 | loss 0.76 | bpc 1.09688 | |
| epoch 139 step 94830 | 162 batches | lr 0.000434 | ms/batch 5127.48 | loss 0.78 | bpc 1.13015 | |
| epoch 139 step 94840 | 172 batches | lr 0.000434 | ms/batch 5127.12 | loss 0.70 | bpc 1.01362 | |
| epoch 139 step 94850 | 182 batches | lr 0.000434 | ms/batch 5123.50 | loss 0.68 | bpc 0.97917 | |
| epoch 139 step 94860 | 192 batches | lr 0.000434 | ms/batch 5126.51 | loss 0.69 | bpc 0.98891 | |
| epoch 139 step 94870 | 202 batches | lr 0.000434 | ms/batch 5123.98 | loss 0.61 | bpc 0.88130 | |
| epoch 139 step 94880 | 212 batches | lr 0.000434 | ms/batch 5121.24 | loss 0.73 | bpc 1.05134 | |
| epoch 139 step 94890 | 222 batches | lr 0.000434 | ms/batch 5121.61 | loss 0.76 | bpc 1.09779 | |
| epoch 139 step 94900 | 232 batches | lr 0.000434 | ms/batch 5133.83 | loss 0.74 | bpc 1.06997 | |
| epoch 139 step 94910 | 242 batches | lr 0.000434 | ms/batch 5114.58 | loss 0.74 | bpc 1.07003 | |
| epoch 139 step 94920 | 252 batches | lr 0.000434 | ms/batch 5111.28 | loss 0.79 | bpc 1.14244 | |
| epoch 139 step 94930 | 262 batches | lr 0.000434 | ms/batch 5122.40 | loss 0.72 | bpc 1.03159 | |
| epoch 139 step 94940 | 272 batches | lr 0.000434 | ms/batch 5123.72 | loss 0.71 | bpc 1.01863 | |
| epoch 139 step 94950 | 282 batches | lr 0.000434 | ms/batch 5119.95 | loss 0.77 | bpc 1.10381 | |
| epoch 139 step 94960 | 292 batches | lr 0.000434 | ms/batch 5126.96 | loss 0.77 | bpc 1.10410 | |
| epoch 139 step 94970 | 302 batches | lr 0.000434 | ms/batch 5126.37 | loss 0.76 | bpc 1.09092 | |
| epoch 139 step 94980 | 312 batches | lr 0.000434 | ms/batch 5122.77 | loss 0.78 | bpc 1.12285 | |
| epoch 139 step 94990 | 322 batches | lr 0.000434 | ms/batch 5125.20 | loss 0.74 | bpc 1.06340 | |
| epoch 139 step 95000 | 332 batches | lr 0.000434 | ms/batch 5126.05 | loss 0.77 | bpc 1.10942 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 380 at step 95000 | time: 1349.20s | valid loss 0.74 | bpc 1.07174 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 139 step 95010 | 342 batches | lr 0.000434 | ms/batch 11936.92 | loss 0.70 | bpc 1.00856 | |
| epoch 139 step 95020 | 352 batches | lr 0.000434 | ms/batch 5126.52 | loss 0.77 | bpc 1.10700 | |
| epoch 139 step 95030 | 362 batches | lr 0.000434 | ms/batch 5124.78 | loss 0.77 | bpc 1.10818 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
| epoch 139 step 95040 | 372 batches | lr 0.000434 | ms/batch 5123.07 | loss 0.72 | bpc 1.03961 | |
| epoch 139 step 95050 | 382 batches | lr 0.000434 | ms/batch 5127.18 | loss 0.76 | bpc 1.09516 | |
| epoch 139 step 95060 | 392 batches | lr 0.000434 | ms/batch 5122.51 | loss 0.75 | bpc 1.08648 | |
| epoch 139 step 95070 | 402 batches | lr 0.000433 | ms/batch 5118.40 | loss 0.79 | bpc 1.13840 | |
| epoch 139 step 95080 | 412 batches | lr 0.000433 | ms/batch 5121.55 | loss 0.75 | bpc 1.07699 | |
| epoch 139 step 95090 | 422 batches | lr 0.000433 | ms/batch 5126.27 | loss 0.73 | bpc 1.05669 | |
| epoch 139 step 95100 | 432 batches | lr 0.000433 | ms/batch 5127.08 | loss 0.74 | bpc 1.07262 | |
| epoch 139 step 95110 | 442 batches | lr 0.000433 | ms/batch 5126.64 | loss 0.78 | bpc 1.11924 | |
| epoch 139 step 95120 | 452 batches | lr 0.000433 | ms/batch 5129.43 | loss 0.76 | bpc 1.09771 | |
| epoch 139 step 95130 | 462 batches | lr 0.000433 | ms/batch 5123.54 | loss 0.76 | bpc 1.08936 | |
| epoch 139 step 95140 | 472 batches | lr 0.000433 | ms/batch 5123.67 | loss 0.74 | bpc 1.06832 | |
| epoch 139 step 95150 | 482 batches | lr 0.000433 | ms/batch 5124.47 | loss 0.76 | bpc 1.09452 | |
| epoch 139 step 95160 | 492 batches | lr 0.000433 | ms/batch 5125.68 | loss 0.74 | bpc 1.07144 | |
| epoch 139 step 95170 | 502 batches | lr 0.000433 | ms/batch 5123.88 | loss 0.74 | bpc 1.06827 | |
| epoch 139 step 95180 | 512 batches | lr 0.000433 | ms/batch 5124.25 | loss 0.74 | bpc 1.07179 | |
| epoch 139 step 95190 | 522 batches | lr 0.000433 | ms/batch 5113.61 | loss 0.75 | bpc 1.08604 | |
| epoch 139 step 95200 | 532 batches | lr 0.000433 | ms/batch 5110.38 | loss 0.79 | bpc 1.14135 | |
| epoch 139 step 95210 | 542 batches | lr 0.000433 | ms/batch 5117.81 | loss 0.75 | bpc 1.08428 | |
| epoch 139 step 95220 | 552 batches | lr 0.000433 | ms/batch 5127.91 | loss 0.76 | bpc 1.08964 | |
| epoch 139 step 95230 | 562 batches | lr 0.000433 | ms/batch 5123.08 | loss 0.78 | bpc 1.12097 | |
| epoch 139 step 95240 | 572 batches | lr 0.000433 | ms/batch 5128.15 | loss 0.75 | bpc 1.08046 | |
| epoch 139 step 95250 | 582 batches | lr 0.000433 | ms/batch 5125.29 | loss 0.73 | bpc 1.05245 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 381 at step 95250 | time: 1349.11s | valid loss 0.75 | bpc 1.07664 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 139 step 95260 | 592 batches | lr 0.000433 | ms/batch 11933.65 | loss 0.77 | bpc 1.11679 | |
| epoch 139 step 95270 | 602 batches | lr 0.000433 | ms/batch 5125.01 | loss 0.74 | bpc 1.07227 | |
| epoch 139 step 95280 | 612 batches | lr 0.000433 | ms/batch 5122.39 | loss 0.72 | bpc 1.04350 | |
| epoch 139 step 95290 | 622 batches | lr 0.000433 | ms/batch 5128.57 | loss 0.77 | bpc 1.10949 | |
| epoch 139 step 95300 | 632 batches | lr 0.000433 | ms/batch 5126.67 | loss 0.76 | bpc 1.10094 | |
| epoch 139 step 95310 | 642 batches | lr 0.000433 | ms/batch 5124.12 | loss 0.76 | bpc 1.08982 | |
| epoch 139 step 95320 | 652 batches | lr 0.000433 | ms/batch 5122.40 | loss 0.79 | bpc 1.13868 | |
| epoch 139 step 95330 | 662 batches | lr 0.000433 | ms/batch 5116.28 | loss 0.78 | bpc 1.12439 | |
| epoch 139 step 95340 | 672 batches | lr 0.000433 | ms/batch 5127.17 | loss 0.72 | bpc 1.04112 | |
| epoch 139 step 95350 | 682 batches | lr 0.000433 | ms/batch 5124.27 | loss 0.71 | bpc 1.02218 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 140 step 95360 | 6 batches | lr 0.000433 | ms/batch 5124.32 | loss 0.71 | bpc 1.02453 | |
| epoch 140 step 95370 | 16 batches | lr 0.000433 | ms/batch 5119.99 | loss 0.74 | bpc 1.07471 | |
| epoch 140 step 95380 | 26 batches | lr 0.000433 | ms/batch 5124.90 | loss 0.78 | bpc 1.12215 | |
| epoch 140 step 95390 | 36 batches | lr 0.000433 | ms/batch 5124.08 | loss 0.71 | bpc 1.02018 | |
| epoch 140 step 95400 | 46 batches | lr 0.000433 | ms/batch 5117.27 | loss 0.74 | bpc 1.06449 | |
| epoch 140 step 95410 | 56 batches | lr 0.000433 | ms/batch 5123.31 | loss 0.77 | bpc 1.11520 | |
| epoch 140 step 95420 | 66 batches | lr 0.000433 | ms/batch 5123.28 | loss 0.76 | bpc 1.09756 | |
| epoch 140 step 95430 | 76 batches | lr 0.000433 | ms/batch 5125.95 | loss 0.72 | bpc 1.03453 | |
| epoch 140 step 95440 | 86 batches | lr 0.000433 | ms/batch 5122.74 | loss 0.72 | bpc 1.03616 | |
| epoch 140 step 95450 | 96 batches | lr 0.000433 | ms/batch 5125.27 | loss 0.72 | bpc 1.03370 | |
| epoch 140 step 95460 | 106 batches | lr 0.000433 | ms/batch 5120.50 | loss 0.75 | bpc 1.08859 | |
| epoch 140 step 95470 | 116 batches | lr 0.000433 | ms/batch 5119.06 | loss 0.74 | bpc 1.07450 | |
| epoch 140 step 95480 | 126 batches | lr 0.000433 | ms/batch 5128.59 | loss 0.76 | bpc 1.09167 | |
| epoch 140 step 95490 | 136 batches | lr 0.000433 | ms/batch 5126.28 | loss 0.77 | bpc 1.11717 | |
| epoch 140 step 95500 | 146 batches | lr 0.000433 | ms/batch 5111.41 | loss 0.77 | bpc 1.10991 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 382 at step 95500 | time: 1348.88s | valid loss 0.74 | bpc 1.07078 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 140 step 95510 | 156 batches | lr 0.000433 | ms/batch 11937.58 | loss 0.77 | bpc 1.11250 | |
| epoch 140 step 95520 | 166 batches | lr 0.000433 | ms/batch 5120.65 | loss 0.78 | bpc 1.12161 | |
| epoch 140 step 95530 | 176 batches | lr 0.000433 | ms/batch 5127.33 | loss 0.68 | bpc 0.98176 | |
| epoch 140 step 95540 | 186 batches | lr 0.000433 | ms/batch 5121.15 | loss 0.68 | bpc 0.98519 | |
| epoch 140 step 95550 | 196 batches | lr 0.000433 | ms/batch 5124.66 | loss 0.66 | bpc 0.95470 | |
| epoch 140 step 95560 | 206 batches | lr 0.000433 | ms/batch 5125.40 | loss 0.65 | bpc 0.93864 | |
| epoch 140 step 95570 | 216 batches | lr 0.000433 | ms/batch 5126.40 | loss 0.75 | bpc 1.08320 | |
| epoch 140 step 95580 | 226 batches | lr 0.000433 | ms/batch 5119.82 | loss 0.74 | bpc 1.06770 | |
| epoch 140 step 95590 | 236 batches | lr 0.000433 | ms/batch 5120.07 | loss 0.74 | bpc 1.06156 | |
| epoch 140 step 95600 | 246 batches | lr 0.000433 | ms/batch 5124.59 | loss 0.78 | bpc 1.12296 | |
| epoch 140 step 95610 | 256 batches | lr 0.000433 | ms/batch 5128.45 | loss 0.75 | bpc 1.07933 | |
| epoch 140 step 95620 | 266 batches | lr 0.000433 | ms/batch 5125.60 | loss 0.70 | bpc 1.01103 | |
| epoch 140 step 95630 | 276 batches | lr 0.000433 | ms/batch 5123.98 | loss 0.74 | bpc 1.06968 | |
| epoch 140 step 95640 | 286 batches | lr 0.000433 | ms/batch 5125.66 | loss 0.79 | bpc 1.13333 | |
| epoch 140 step 95650 | 296 batches | lr 0.000433 | ms/batch 5119.11 | loss 0.74 | bpc 1.07271 | |
| epoch 140 step 95660 | 306 batches | lr 0.000433 | ms/batch 5122.45 | loss 0.75 | bpc 1.08275 | |
| epoch 140 step 95670 | 316 batches | lr 0.000433 | ms/batch 5126.38 | loss 0.78 | bpc 1.11849 | |
| epoch 140 step 95680 | 326 batches | lr 0.000433 | ms/batch 5127.75 | loss 0.75 | bpc 1.07864 | |
| epoch 140 step 95690 | 336 batches | lr 0.000433 | ms/batch 5123.21 | loss 0.75 | bpc 1.07753 | |
| epoch 140 step 95700 | 346 batches | lr 0.000433 | ms/batch 5123.39 | loss 0.72 | bpc 1.03681 | |
| epoch 140 step 95710 | 356 batches | lr 0.000433 | ms/batch 5127.67 | loss 0.78 | bpc 1.12506 | |
| epoch 140 step 95720 | 366 batches | lr 0.000433 | ms/batch 5117.95 | loss 0.74 | bpc 1.06681 | |
| epoch 140 step 95730 | 376 batches | lr 0.000433 | ms/batch 5122.99 | loss 0.73 | bpc 1.05486 | |
| epoch 140 step 95740 | 386 batches | lr 0.000433 | ms/batch 5126.79 | loss 0.77 | bpc 1.11117 | |
| epoch 140 step 95750 | 396 batches | lr 0.000433 | ms/batch 5128.96 | loss 0.76 | bpc 1.09738 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 383 at step 95750 | time: 1349.29s | valid loss 0.74 | bpc 1.07126 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 140 step 95760 | 406 batches | lr 0.000433 | ms/batch 11941.76 | loss 0.77 | bpc 1.10719 | |
| epoch 140 step 95770 | 416 batches | lr 0.000433 | ms/batch 5125.06 | loss 0.75 | bpc 1.08093 | |
| epoch 140 step 95780 | 426 batches | lr 0.000433 | ms/batch 5124.10 | loss 0.72 | bpc 1.03321 | |
| epoch 140 step 95790 | 436 batches | lr 0.000433 | ms/batch 5124.10 | loss 0.78 | bpc 1.12472 | |
| epoch 140 step 95800 | 446 batches | lr 0.000433 | ms/batch 5127.90 | loss 0.76 | bpc 1.09225 | |
| epoch 140 step 95810 | 456 batches | lr 0.000432 | ms/batch 5123.80 | loss 0.76 | bpc 1.09098 | |
| epoch 140 step 95820 | 466 batches | lr 0.000432 | ms/batch 5110.90 | loss 0.75 | bpc 1.07711 | |
| epoch 140 step 95830 | 476 batches | lr 0.000432 | ms/batch 5107.94 | loss 0.74 | bpc 1.06761 | |
| epoch 140 step 95840 | 486 batches | lr 0.000432 | ms/batch 5115.61 | loss 0.76 | bpc 1.09537 | |
| epoch 140 step 95850 | 496 batches | lr 0.000432 | ms/batch 5125.26 | loss 0.75 | bpc 1.08684 | |
| epoch 140 step 95860 | 506 batches | lr 0.000432 | ms/batch 5130.22 | loss 0.74 | bpc 1.06121 | |
| epoch 140 step 95870 | 516 batches | lr 0.000432 | ms/batch 5125.69 | loss 0.75 | bpc 1.08475 | |
| epoch 140 step 95880 | 526 batches | lr 0.000432 | ms/batch 5127.44 | loss 0.76 | bpc 1.09670 | |
| epoch 140 step 95890 | 536 batches | lr 0.000432 | ms/batch 5126.51 | loss 0.79 | bpc 1.13858 | |
| epoch 140 step 95900 | 546 batches | lr 0.000432 | ms/batch 5126.05 | loss 0.74 | bpc 1.06818 | |
| epoch 140 step 95910 | 556 batches | lr 0.000432 | ms/batch 5120.81 | loss 0.77 | bpc 1.10715 | |
| epoch 140 step 95920 | 566 batches | lr 0.000432 | ms/batch 5126.88 | loss 0.78 | bpc 1.11989 | |
| epoch 140 step 95930 | 576 batches | lr 0.000432 | ms/batch 5127.51 | loss 0.74 | bpc 1.06395 | |
| epoch 140 step 95940 | 586 batches | lr 0.000432 | ms/batch 5124.65 | loss 0.74 | bpc 1.07180 | |
| epoch 140 step 95950 | 596 batches | lr 0.000432 | ms/batch 5125.97 | loss 0.77 | bpc 1.11154 | |
| epoch 140 step 95960 | 606 batches | lr 0.000432 | ms/batch 5125.59 | loss 0.74 | bpc 1.06412 | |
| epoch 140 step 95970 | 616 batches | lr 0.000432 | ms/batch 5124.64 | loss 0.74 | bpc 1.06261 | |
| epoch 140 step 95980 | 626 batches | lr 0.000432 | ms/batch 5119.59 | loss 0.76 | bpc 1.09283 | |
| epoch 140 step 95990 | 636 batches | lr 0.000432 | ms/batch 5124.83 | loss 0.77 | bpc 1.10548 | |
| epoch 140 step 96000 | 646 batches | lr 0.000432 | ms/batch 5125.79 | loss 0.76 | bpc 1.09583 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 384 at step 96000 | time: 1352.32s | valid loss 0.74 | bpc 1.07481 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 140 step 96010 | 656 batches | lr 0.000432 | ms/batch 11945.96 | loss 0.79 | bpc 1.13688 | |
| epoch 140 step 96020 | 666 batches | lr 0.000432 | ms/batch 5127.74 | loss 0.78 | bpc 1.12369 | |
| epoch 140 step 96030 | 676 batches | lr 0.000432 | ms/batch 5120.58 | loss 0.70 | bpc 1.00978 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
| epoch 140 step 96040 | 686 batches | lr 0.000432 | ms/batch 5121.95 | loss 0.69 | bpc 1.00215 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 141 step 96050 | 10 batches | lr 0.000432 | ms/batch 5127.71 | loss 0.73 | bpc 1.05169 | |
| epoch 141 step 96060 | 20 batches | lr 0.000432 | ms/batch 5125.65 | loss 0.76 | bpc 1.09595 | |
| epoch 141 step 96070 | 30 batches | lr 0.000432 | ms/batch 5126.96 | loss 0.75 | bpc 1.08467 | |
| epoch 141 step 96080 | 40 batches | lr 0.000432 | ms/batch 5126.53 | loss 0.71 | bpc 1.03144 | |
| epoch 141 step 96090 | 50 batches | lr 0.000432 | ms/batch 5124.73 | loss 0.76 | bpc 1.09211 | |
| epoch 141 step 96100 | 60 batches | lr 0.000432 | ms/batch 5127.60 | loss 0.77 | bpc 1.10513 | |
| epoch 141 step 96110 | 70 batches | lr 0.000432 | ms/batch 5125.50 | loss 0.74 | bpc 1.07240 | |
| epoch 141 step 96120 | 80 batches | lr 0.000432 | ms/batch 5126.85 | loss 0.71 | bpc 1.01875 | |
| epoch 141 step 96130 | 90 batches | lr 0.000432 | ms/batch 5125.44 | loss 0.73 | bpc 1.05720 | |
| epoch 141 step 96140 | 100 batches | lr 0.000432 | ms/batch 5126.24 | loss 0.71 | bpc 1.02014 | |
| epoch 141 step 96150 | 110 batches | lr 0.000432 | ms/batch 5125.54 | loss 0.77 | bpc 1.10669 | |
| epoch 141 step 96160 | 120 batches | lr 0.000432 | ms/batch 5123.47 | loss 0.74 | bpc 1.06048 | |
| epoch 141 step 96170 | 130 batches | lr 0.000432 | ms/batch 5127.32 | loss 0.78 | bpc 1.12161 | |
| epoch 141 step 96180 | 140 batches | lr 0.000432 | ms/batch 5111.52 | loss 0.79 | bpc 1.14322 | |
| epoch 141 step 96190 | 150 batches | lr 0.000432 | ms/batch 5112.34 | loss 0.75 | bpc 1.08179 | |
| epoch 141 step 96200 | 160 batches | lr 0.000432 | ms/batch 5112.10 | loss 0.79 | bpc 1.13382 | |
| epoch 141 step 96210 | 170 batches | lr 0.000432 | ms/batch 5121.98 | loss 0.73 | bpc 1.05524 | |
| epoch 141 step 96220 | 180 batches | lr 0.000432 | ms/batch 5124.73 | loss 0.68 | bpc 0.97462 | |
| epoch 141 step 96230 | 190 batches | lr 0.000432 | ms/batch 5124.47 | loss 0.67 | bpc 0.97335 | |
| epoch 141 step 96240 | 200 batches | lr 0.000432 | ms/batch 5125.51 | loss 0.63 | bpc 0.91298 | |
| epoch 141 step 96250 | 210 batches | lr 0.000432 | ms/batch 5127.39 | loss 0.70 | bpc 1.01110 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 385 at step 96250 | time: 1349.16s | valid loss 0.74 | bpc 1.07436 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 141 step 96260 | 220 batches | lr 0.000432 | ms/batch 11937.33 | loss 0.77 | bpc 1.10689 | |
| epoch 141 step 96270 | 230 batches | lr 0.000432 | ms/batch 5122.92 | loss 0.74 | bpc 1.06636 | |
| epoch 141 step 96280 | 240 batches | lr 0.000432 | ms/batch 5128.02 | loss 0.74 | bpc 1.06648 | |
| epoch 141 step 96290 | 250 batches | lr 0.000432 | ms/batch 5125.68 | loss 0.78 | bpc 1.13112 | |
| epoch 141 step 96300 | 260 batches | lr 0.000432 | ms/batch 5126.58 | loss 0.72 | bpc 1.04356 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
| epoch 141 step 96310 | 270 batches | lr 0.000432 | ms/batch 5123.95 | loss 0.71 | bpc 1.01865 | |
| epoch 141 step 96320 | 280 batches | lr 0.000432 | ms/batch 5128.07 | loss 0.75 | bpc 1.08395 | |
| epoch 141 step 96330 | 290 batches | lr 0.000432 | ms/batch 5121.16 | loss 0.78 | bpc 1.12022 | |
| epoch 141 step 96340 | 300 batches | lr 0.000432 | ms/batch 5125.08 | loss 0.75 | bpc 1.08346 | |
| epoch 141 step 96350 | 310 batches | lr 0.000432 | ms/batch 5128.04 | loss 0.77 | bpc 1.11117 | |
| epoch 141 step 96360 | 320 batches | lr 0.000432 | ms/batch 5127.93 | loss 0.73 | bpc 1.05946 | |
| epoch 141 step 96370 | 330 batches | lr 0.000432 | ms/batch 5128.05 | loss 0.76 | bpc 1.09847 | |
| epoch 141 step 96380 | 340 batches | lr 0.000432 | ms/batch 5128.05 | loss 0.72 | bpc 1.04358 | |
| epoch 141 step 96390 | 350 batches | lr 0.000432 | ms/batch 5128.57 | loss 0.74 | bpc 1.07142 | |
| epoch 141 step 96400 | 360 batches | lr 0.000432 | ms/batch 5118.73 | loss 0.79 | bpc 1.13636 | |
| epoch 141 step 96410 | 370 batches | lr 0.000432 | ms/batch 5126.20 | loss 0.73 | bpc 1.05501 | |
| epoch 141 step 96420 | 380 batches | lr 0.000432 | ms/batch 5125.43 | loss 0.75 | bpc 1.07943 | |
| epoch 141 step 96430 | 390 batches | lr 0.000432 | ms/batch 5125.88 | loss 0.75 | bpc 1.07834 | |
| epoch 141 step 96440 | 400 batches | lr 0.000432 | ms/batch 5129.00 | loss 0.79 | bpc 1.13689 | |
| epoch 141 step 96450 | 410 batches | lr 0.000432 | ms/batch 5126.49 | loss 0.76 | bpc 1.09376 | |
| epoch 141 step 96460 | 420 batches | lr 0.000432 | ms/batch 5126.84 | loss 0.73 | bpc 1.05081 | |
| epoch 141 step 96470 | 430 batches | lr 0.000432 | ms/batch 5124.50 | loss 0.74 | bpc 1.06219 | |
| epoch 141 step 96480 | 440 batches | lr 0.000432 | ms/batch 5126.20 | loss 0.77 | bpc 1.11118 | |
| epoch 141 step 96490 | 450 batches | lr 0.000432 | ms/batch 5127.78 | loss 0.76 | bpc 1.09711 | |
| epoch 141 step 96500 | 460 batches | lr 0.000432 | ms/batch 5128.34 | loss 0.76 | bpc 1.09085 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 386 at step 96500 | time: 1349.77s | valid loss 0.74 | bpc 1.07314 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 141 step 96510 | 470 batches | lr 0.000432 | ms/batch 11944.34 | loss 0.73 | bpc 1.05398 | |
| epoch 141 step 96520 | 480 batches | lr 0.000432 | ms/batch 5126.12 | loss 0.77 | bpc 1.10448 | |
| epoch 141 step 96530 | 490 batches | lr 0.000432 | ms/batch 5129.32 | loss 0.74 | bpc 1.07203 | |
| epoch 141 step 96540 | 500 batches | lr 0.000432 | ms/batch 5129.17 | loss 0.74 | bpc 1.06889 | |
| epoch 141 step 96550 | 510 batches | lr 0.000432 | ms/batch 5125.62 | loss 0.75 | bpc 1.07538 | |
| epoch 141 step 96560 | 520 batches | lr 0.000431 | ms/batch 5112.83 | loss 0.76 | bpc 1.09061 | |
| epoch 141 step 96570 | 530 batches | lr 0.000431 | ms/batch 5111.93 | loss 0.77 | bpc 1.11201 | |
| epoch 141 step 96580 | 540 batches | lr 0.000431 | ms/batch 5110.38 | loss 0.78 | bpc 1.12564 | |
| epoch 141 step 96590 | 550 batches | lr 0.000431 | ms/batch 5122.44 | loss 0.74 | bpc 1.06404 | |
| epoch 141 step 96600 | 560 batches | lr 0.000431 | ms/batch 5131.34 | loss 0.77 | bpc 1.10997 | |
| epoch 141 step 96610 | 570 batches | lr 0.000431 | ms/batch 5129.31 | loss 0.76 | bpc 1.09649 | |
| epoch 141 step 96620 | 580 batches | lr 0.000431 | ms/batch 5128.21 | loss 0.74 | bpc 1.06629 | |
| epoch 141 step 96630 | 590 batches | lr 0.000431 | ms/batch 5126.94 | loss 0.75 | bpc 1.08347 | |
| epoch 141 step 96640 | 600 batches | lr 0.000431 | ms/batch 5125.07 | loss 0.75 | bpc 1.08675 | |
| epoch 141 step 96650 | 610 batches | lr 0.000431 | ms/batch 5130.56 | loss 0.73 | bpc 1.05618 | |
| epoch 141 step 96660 | 620 batches | lr 0.000431 | ms/batch 5128.29 | loss 0.75 | bpc 1.08803 | |
| epoch 141 step 96670 | 630 batches | lr 0.000431 | ms/batch 5130.34 | loss 0.77 | bpc 1.10551 | |
| epoch 141 step 96680 | 640 batches | lr 0.000431 | ms/batch 5127.48 | loss 0.75 | bpc 1.08583 | |
| epoch 141 step 96690 | 650 batches | lr 0.000431 | ms/batch 5127.99 | loss 0.79 | bpc 1.14271 | |
| epoch 141 step 96700 | 660 batches | lr 0.000431 | ms/batch 5123.00 | loss 0.77 | bpc 1.11342 | |
| epoch 141 step 96710 | 670 batches | lr 0.000431 | ms/batch 5124.72 | loss 0.75 | bpc 1.08183 | |
| epoch 141 step 96720 | 680 batches | lr 0.000431 | ms/batch 5127.77 | loss 0.70 | bpc 1.01069 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 142 step 96730 | 4 batches | lr 0.000431 | ms/batch 5130.71 | loss 0.70 | bpc 1.00781 | |
| epoch 142 step 96740 | 14 batches | lr 0.000431 | ms/batch 5127.98 | loss 0.74 | bpc 1.06889 | |
| epoch 142 step 96750 | 24 batches | lr 0.000431 | ms/batch 5126.93 | loss 0.77 | bpc 1.11191 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 387 at step 96750 | time: 1357.63s | valid loss 0.74 | bpc 1.07156 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 142 step 96760 | 34 batches | lr 0.000431 | ms/batch 11942.42 | loss 0.73 | bpc 1.05025 | |
| epoch 142 step 96770 | 44 batches | lr 0.000431 | ms/batch 5130.72 | loss 0.72 | bpc 1.04545 | |
| epoch 142 step 96780 | 54 batches | lr 0.000431 | ms/batch 5132.15 | loss 0.77 | bpc 1.10910 | |
| epoch 142 step 96790 | 64 batches | lr 0.000431 | ms/batch 5126.77 | loss 0.75 | bpc 1.08379 | |
| epoch 142 step 96800 | 74 batches | lr 0.000431 | ms/batch 5128.39 | loss 0.74 | bpc 1.06948 | |
| epoch 142 step 96810 | 84 batches | lr 0.000431 | ms/batch 5131.17 | loss 0.70 | bpc 1.01414 | |
| epoch 142 step 96820 | 94 batches | lr 0.000431 | ms/batch 5132.47 | loss 0.72 | bpc 1.04163 | |
| epoch 142 step 96830 | 104 batches | lr 0.000431 | ms/batch 5129.43 | loss 0.74 | bpc 1.07443 | |
| epoch 142 step 96840 | 114 batches | lr 0.000431 | ms/batch 5129.54 | loss 0.75 | bpc 1.07660 | |
| epoch 142 step 96850 | 124 batches | lr 0.000431 | ms/batch 5129.20 | loss 0.76 | bpc 1.09513 | |
| epoch 142 step 96860 | 134 batches | lr 0.000431 | ms/batch 5131.29 | loss 0.76 | bpc 1.09823 | |
| epoch 142 step 96870 | 144 batches | lr 0.000431 | ms/batch 5129.92 | loss 0.78 | bpc 1.13223 | |
| epoch 142 step 96880 | 154 batches | lr 0.000431 | ms/batch 5124.83 | loss 0.76 | bpc 1.10074 | |
| epoch 142 step 96890 | 164 batches | lr 0.000431 | ms/batch 5128.38 | loss 0.79 | bpc 1.13790 | |
| epoch 142 step 96900 | 174 batches | lr 0.000431 | ms/batch 5129.30 | loss 0.68 | bpc 0.98089 | |
| epoch 142 step 96910 | 184 batches | lr 0.000431 | ms/batch 5129.64 | loss 0.68 | bpc 0.98558 | |
| epoch 142 step 96920 | 194 batches | lr 0.000431 | ms/batch 5128.70 | loss 0.69 | bpc 0.99378 | |
| epoch 142 step 96930 | 204 batches | lr 0.000431 | ms/batch 5126.87 | loss 0.62 | bpc 0.89657 | |
| epoch 142 step 96940 | 214 batches | lr 0.000431 | ms/batch 5130.78 | loss 0.73 | bpc 1.05288 | |
| epoch 142 step 96950 | 224 batches | lr 0.000431 | ms/batch 5127.45 | loss 0.74 | bpc 1.07466 | |
| epoch 142 step 96960 | 234 batches | lr 0.000431 | ms/batch 5125.65 | loss 0.74 | bpc 1.06656 | |
| epoch 142 step 96970 | 244 batches | lr 0.000431 | ms/batch 5132.50 | loss 0.76 | bpc 1.09150 | |
| epoch 142 step 96980 | 254 batches | lr 0.000431 | ms/batch 5130.09 | loss 0.78 | bpc 1.12462 | |
| epoch 142 step 96990 | 264 batches | lr 0.000431 | ms/batch 5125.75 | loss 0.70 | bpc 1.01548 | |
| epoch 142 step 97000 | 274 batches | lr 0.000431 | ms/batch 5112.42 | loss 0.72 | bpc 1.04424 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 388 at step 97000 | time: 1350.38s | valid loss 0.74 | bpc 1.06756 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 142 step 97010 | 284 batches | lr 0.000431 | ms/batch 11923.99 | loss 0.77 | bpc 1.10858 | |
| epoch 142 step 97020 | 294 batches | lr 0.000431 | ms/batch 5131.61 | loss 0.76 | bpc 1.09604 | |
| epoch 142 step 97030 | 304 batches | lr 0.000431 | ms/batch 5130.64 | loss 0.76 | bpc 1.09196 | |
| epoch 142 step 97040 | 314 batches | lr 0.000431 | ms/batch 5134.65 | loss 0.77 | bpc 1.10711 | |
| epoch 142 step 97050 | 324 batches | lr 0.000431 | ms/batch 5130.50 | loss 0.75 | bpc 1.07969 | |
| epoch 142 step 97060 | 334 batches | lr 0.000431 | ms/batch 5130.67 | loss 0.76 | bpc 1.09259 | |
| epoch 142 step 97070 | 344 batches | lr 0.000431 | ms/batch 5119.63 | loss 0.71 | bpc 1.02176 | |
| epoch 142 step 97080 | 354 batches | lr 0.000431 | ms/batch 5128.96 | loss 0.78 | bpc 1.12317 | |
| epoch 142 step 97090 | 364 batches | lr 0.000431 | ms/batch 5128.94 | loss 0.75 | bpc 1.07901 | |
| epoch 142 step 97100 | 374 batches | lr 0.000431 | ms/batch 5131.16 | loss 0.72 | bpc 1.04381 | |
| epoch 142 step 97110 | 384 batches | lr 0.000431 | ms/batch 5132.30 | loss 0.76 | bpc 1.10097 | |
| epoch 142 step 97120 | 394 batches | lr 0.000431 | ms/batch 5132.85 | loss 0.75 | bpc 1.08669 | |
| epoch 142 step 97130 | 404 batches | lr 0.000431 | ms/batch 5129.23 | loss 0.77 | bpc 1.11108 | |
| epoch 142 step 97140 | 414 batches | lr 0.000431 | ms/batch 5125.17 | loss 0.76 | bpc 1.09074 | |
| epoch 142 step 97150 | 424 batches | lr 0.000431 | ms/batch 5130.87 | loss 0.72 | bpc 1.04130 | |
| epoch 142 step 97160 | 434 batches | lr 0.000431 | ms/batch 5129.74 | loss 0.76 | bpc 1.09622 | |
| epoch 142 step 97170 | 444 batches | lr 0.000431 | ms/batch 5131.36 | loss 0.76 | bpc 1.09815 | |
| epoch 142 step 97180 | 454 batches | lr 0.000431 | ms/batch 5132.76 | loss 0.77 | bpc 1.10811 | |
| epoch 142 step 97190 | 464 batches | lr 0.000431 | ms/batch 5130.68 | loss 0.75 | bpc 1.07853 | |
| epoch 142 step 97200 | 474 batches | lr 0.000431 | ms/batch 5130.55 | loss 0.74 | bpc 1.06056 | |
| epoch 142 step 97210 | 484 batches | lr 0.000431 | ms/batch 5127.79 | loss 0.75 | bpc 1.08760 | |
| epoch 142 step 97220 | 494 batches | lr 0.000431 | ms/batch 5132.59 | loss 0.76 | bpc 1.09304 | |
| epoch 142 step 97230 | 504 batches | lr 0.000431 | ms/batch 5133.11 | loss 0.73 | bpc 1.05542 | |
| epoch 142 step 97240 | 514 batches | lr 0.000431 | ms/batch 5133.62 | loss 0.74 | bpc 1.06759 | |
| epoch 142 step 97250 | 524 batches | lr 0.000431 | ms/batch 5131.22 | loss 0.75 | bpc 1.08336 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 389 at step 97250 | time: 1355.96s | valid loss 0.74 | bpc 1.06993 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 142 step 97260 | 534 batches | lr 0.000431 | ms/batch 11944.49 | loss 0.80 | bpc 1.14809 | |
| epoch 142 step 97270 | 544 batches | lr 0.000431 | ms/batch 5131.13 | loss 0.74 | bpc 1.07420 | |
| epoch 142 step 97280 | 554 batches | lr 0.000431 | ms/batch 5129.75 | loss 0.76 | bpc 1.09996 | |
| epoch 142 step 97290 | 564 batches | lr 0.00043 | ms/batch 5131.24 | loss 0.77 | bpc 1.11402 | |
| epoch 142 step 97300 | 574 batches | lr 0.00043 | ms/batch 5131.62 | loss 0.74 | bpc 1.07032 | |
| epoch 142 step 97310 | 584 batches | lr 0.00043 | ms/batch 5129.91 | loss 0.74 | bpc 1.06349 | |
| epoch 142 step 97320 | 594 batches | lr 0.00043 | ms/batch 5131.71 | loss 0.77 | bpc 1.10582 | |
| epoch 142 step 97330 | 604 batches | lr 0.00043 | ms/batch 5129.56 | loss 0.75 | bpc 1.08081 | |
| epoch 142 step 97340 | 614 batches | lr 0.00043 | ms/batch 5129.16 | loss 0.72 | bpc 1.04492 | |
| epoch 142 step 97350 | 624 batches | lr 0.00043 | ms/batch 5130.03 | loss 0.76 | bpc 1.09383 | |
| epoch 142 step 97360 | 634 batches | lr 0.00043 | ms/batch 5129.87 | loss 0.77 | bpc 1.11037 | |
| epoch 142 step 97370 | 644 batches | lr 0.00043 | ms/batch 5135.54 | loss 0.75 | bpc 1.08050 | |
| epoch 142 step 97380 | 654 batches | lr 0.00043 | ms/batch 5132.89 | loss 0.80 | bpc 1.15630 | |
| epoch 142 step 97390 | 664 batches | lr 0.00043 | ms/batch 5129.43 | loss 0.77 | bpc 1.11663 | |
| epoch 142 step 97400 | 674 batches | lr 0.00043 | ms/batch 5130.24 | loss 0.71 | bpc 1.02130 | |
| epoch 142 step 97410 | 684 batches | lr 0.00043 | ms/batch 5130.71 | loss 0.70 | bpc 1.01013 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 143 step 97420 | 8 batches | lr 0.00043 | ms/batch 5133.54 | loss 0.72 | bpc 1.03918 | |
| epoch 143 step 97430 | 18 batches | lr 0.00043 | ms/batch 5128.15 | loss 0.75 | bpc 1.08081 | |
| epoch 143 step 97440 | 28 batches | lr 0.00043 | ms/batch 5128.76 | loss 0.76 | bpc 1.10207 | |
| epoch 143 step 97450 | 38 batches | lr 0.00043 | ms/batch 5131.83 | loss 0.72 | bpc 1.03213 | |
| epoch 143 step 97460 | 48 batches | lr 0.00043 | ms/batch 5108.71 | loss 0.74 | bpc 1.06229 | |
| epoch 143 step 97470 | 58 batches | lr 0.00043 | ms/batch 5110.69 | loss 0.77 | bpc 1.10747 | |
| epoch 143 step 97480 | 68 batches | lr 0.00043 | ms/batch 5111.84 | loss 0.76 | bpc 1.09198 | |
| epoch 143 step 97490 | 78 batches | lr 0.00043 | ms/batch 5111.86 | loss 0.71 | bpc 1.02388 | |
| epoch 143 step 97500 | 88 batches | lr 0.00043 | ms/batch 5132.47 | loss 0.72 | bpc 1.03949 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 390 at step 97500 | time: 1358.41s | valid loss 0.74 | bpc 1.07460 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 143 step 97510 | 98 batches | lr 0.00043 | ms/batch 11946.92 | loss 0.71 | bpc 1.02130 | |
| epoch 143 step 97520 | 108 batches | lr 0.00043 | ms/batch 5131.05 | loss 0.76 | bpc 1.10364 | |
| epoch 143 step 97530 | 118 batches | lr 0.00043 | ms/batch 5130.62 | loss 0.74 | bpc 1.06718 | |
| epoch 143 step 97540 | 128 batches | lr 0.00043 | ms/batch 5127.51 | loss 0.77 | bpc 1.11463 | |
| epoch 143 step 97550 | 138 batches | lr 0.00043 | ms/batch 5131.10 | loss 0.78 | bpc 1.12757 | |
| epoch 143 step 97560 | 148 batches | lr 0.00043 | ms/batch 5131.22 | loss 0.75 | bpc 1.07567 | |
| epoch 143 step 97570 | 158 batches | lr 0.00043 | ms/batch 5127.67 | loss 0.79 | bpc 1.13340 | |
| epoch 143 step 97580 | 168 batches | lr 0.00043 | ms/batch 5130.44 | loss 0.75 | bpc 1.08599 | |
| epoch 143 step 97590 | 178 batches | lr 0.00043 | ms/batch 5131.15 | loss 0.67 | bpc 0.96969 | |
| epoch 143 step 97600 | 188 batches | lr 0.00043 | ms/batch 5134.45 | loss 0.68 | bpc 0.97862 | |
| epoch 143 step 97610 | 198 batches | lr 0.00043 | ms/batch 5129.46 | loss 0.65 | bpc 0.93874 | |
| epoch 143 step 97620 | 208 batches | lr 0.00043 | ms/batch 5130.38 | loss 0.67 | bpc 0.96848 | |
| epoch 143 step 97630 | 218 batches | lr 0.00043 | ms/batch 5130.21 | loss 0.76 | bpc 1.09128 | |
| epoch 143 step 97640 | 228 batches | lr 0.00043 | ms/batch 5129.01 | loss 0.74 | bpc 1.06886 | |
| epoch 143 step 97650 | 238 batches | lr 0.00043 | ms/batch 5130.57 | loss 0.73 | bpc 1.06032 | |
| epoch 143 step 97660 | 248 batches | lr 0.00043 | ms/batch 5132.71 | loss 0.78 | bpc 1.12229 | |
| epoch 143 step 97670 | 258 batches | lr 0.00043 | ms/batch 5130.19 | loss 0.74 | bpc 1.07409 | |
| epoch 143 step 97680 | 268 batches | lr 0.00043 | ms/batch 5129.78 | loss 0.70 | bpc 1.00512 | |
| epoch 143 step 97690 | 278 batches | lr 0.00043 | ms/batch 5124.13 | loss 0.75 | bpc 1.07490 | |
| epoch 143 step 97700 | 288 batches | lr 0.00043 | ms/batch 5127.54 | loss 0.78 | bpc 1.12796 | |
| epoch 143 step 97710 | 298 batches | lr 0.00043 | ms/batch 5128.96 | loss 0.75 | bpc 1.07857 | |
| epoch 143 step 97720 | 308 batches | lr 0.00043 | ms/batch 5135.69 | loss 0.76 | bpc 1.09333 | |
| epoch 143 step 97730 | 318 batches | lr 0.00043 | ms/batch 5131.11 | loss 0.76 | bpc 1.09119 | |
| epoch 143 step 97740 | 328 batches | lr 0.00043 | ms/batch 5131.39 | loss 0.76 | bpc 1.09129 | |
| epoch 143 step 97750 | 338 batches | lr 0.00043 | ms/batch 5127.77 | loss 0.73 | bpc 1.05140 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 391 at step 97750 | time: 1356.46s | valid loss 0.74 | bpc 1.07408 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 143 step 97760 | 348 batches | lr 0.00043 | ms/batch 11956.39 | loss 0.72 | bpc 1.03965 | |
| epoch 143 step 97770 | 358 batches | lr 0.00043 | ms/batch 5128.54 | loss 0.79 | bpc 1.14297 | |
| epoch 143 step 97780 | 368 batches | lr 0.00043 | ms/batch 5134.27 | loss 0.74 | bpc 1.06666 | |
| epoch 143 step 97790 | 378 batches | lr 0.00043 | ms/batch 5132.63 | loss 0.74 | bpc 1.06350 | |
| epoch 143 step 97800 | 388 batches | lr 0.00043 | ms/batch 5129.73 | loss 0.77 | bpc 1.10927 | |
| epoch 143 step 97810 | 398 batches | lr 0.00043 | ms/batch 5124.85 | loss 0.77 | bpc 1.10378 | |
| epoch 143 step 97820 | 408 batches | lr 0.00043 | ms/batch 5126.05 | loss 0.77 | bpc 1.10608 | |
| epoch 143 step 97830 | 418 batches | lr 0.00043 | ms/batch 5132.46 | loss 0.73 | bpc 1.05912 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
| epoch 143 step 97840 | 428 batches | lr 0.00043 | ms/batch 5127.82 | loss 0.73 | bpc 1.04711 | |
| epoch 143 step 97850 | 438 batches | lr 0.00043 | ms/batch 5131.09 | loss 0.77 | bpc 1.10896 | |
| epoch 143 step 97860 | 448 batches | lr 0.00043 | ms/batch 5131.35 | loss 0.76 | bpc 1.10050 | |
| epoch 143 step 97870 | 458 batches | lr 0.00043 | ms/batch 5132.41 | loss 0.76 | bpc 1.09651 | |
| epoch 143 step 97880 | 468 batches | lr 0.00043 | ms/batch 5130.74 | loss 0.72 | bpc 1.04289 | |
| epoch 143 step 97890 | 478 batches | lr 0.00043 | ms/batch 5131.50 | loss 0.76 | bpc 1.09156 | |
| epoch 143 step 97900 | 488 batches | lr 0.00043 | ms/batch 5132.46 | loss 0.75 | bpc 1.08192 | |
| epoch 143 step 97910 | 498 batches | lr 0.00043 | ms/batch 5128.46 | loss 0.75 | bpc 1.07493 | |
| epoch 143 step 97920 | 508 batches | lr 0.00043 | ms/batch 5131.00 | loss 0.75 | bpc 1.08292 | |
| epoch 143 step 97930 | 518 batches | lr 0.00043 | ms/batch 5131.07 | loss 0.74 | bpc 1.06791 | |
| epoch 143 step 97940 | 528 batches | lr 0.00043 | ms/batch 5131.77 | loss 0.77 | bpc 1.10556 | |
| epoch 143 step 97950 | 538 batches | lr 0.00043 | ms/batch 5131.70 | loss 0.79 | bpc 1.13427 | |
| epoch 143 step 97960 | 548 batches | lr 0.00043 | ms/batch 5113.15 | loss 0.73 | bpc 1.05715 | |
| epoch 143 step 97970 | 558 batches | lr 0.00043 | ms/batch 5109.59 | loss 0.76 | bpc 1.10056 | |
| epoch 143 step 97980 | 568 batches | lr 0.00043 | ms/batch 5109.00 | loss 0.77 | bpc 1.11576 | |
| epoch 143 step 97990 | 578 batches | lr 0.00043 | ms/batch 5110.89 | loss 0.73 | bpc 1.05526 | |
| epoch 143 step 98000 | 588 batches | lr 0.00043 | ms/batch 5120.63 | loss 0.75 | bpc 1.08391 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 392 at step 98000 | time: 1349.97s | valid loss 0.74 | bpc 1.07386 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 143 step 98010 | 598 batches | lr 0.00043 | ms/batch 11947.65 | loss 0.76 | bpc 1.09492 | |
| epoch 143 step 98020 | 608 batches | lr 0.00043 | ms/batch 5128.24 | loss 0.74 | bpc 1.06414 | |
| epoch 143 step 98030 | 618 batches | lr 0.000429 | ms/batch 5125.80 | loss 0.75 | bpc 1.07529 | |
| epoch 143 step 98040 | 628 batches | lr 0.000429 | ms/batch 5132.44 | loss 0.76 | bpc 1.10253 | |
| epoch 143 step 98050 | 638 batches | lr 0.000429 | ms/batch 5132.28 | loss 0.74 | bpc 1.06609 | |
| epoch 143 step 98060 | 648 batches | lr 0.000429 | ms/batch 5132.64 | loss 0.79 | bpc 1.14153 | |
| epoch 143 step 98070 | 658 batches | lr 0.000429 | ms/batch 5131.94 | loss 0.77 | bpc 1.11401 | |
| epoch 143 step 98080 | 668 batches | lr 0.000429 | ms/batch 5129.08 | loss 0.77 | bpc 1.10964 | |
| epoch 143 step 98090 | 678 batches | lr 0.000429 | ms/batch 5133.00 | loss 0.70 | bpc 1.00344 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 144 step 98100 | 2 batches | lr 0.000429 | ms/batch 5130.95 | loss 0.69 | bpc 1.00181 | |
| epoch 144 step 98110 | 12 batches | lr 0.000429 | ms/batch 5133.38 | loss 0.74 | bpc 1.06797 | |
| epoch 144 step 98120 | 22 batches | lr 0.000429 | ms/batch 5131.27 | loss 0.77 | bpc 1.10700 | |
| epoch 144 step 98130 | 32 batches | lr 0.000429 | ms/batch 5133.07 | loss 0.73 | bpc 1.05607 | |
| epoch 144 step 98140 | 42 batches | lr 0.000429 | ms/batch 5134.27 | loss 0.72 | bpc 1.04003 | |
| epoch 144 step 98150 | 52 batches | lr 0.000429 | ms/batch 5132.89 | loss 0.76 | bpc 1.08965 | |
| epoch 144 step 98160 | 62 batches | lr 0.000429 | ms/batch 5134.15 | loss 0.75 | bpc 1.08540 | |
| epoch 144 step 98170 | 72 batches | lr 0.000429 | ms/batch 5133.92 | loss 0.75 | bpc 1.08664 | |
| epoch 144 step 98180 | 82 batches | lr 0.000429 | ms/batch 5133.00 | loss 0.70 | bpc 1.01277 | |
| epoch 144 step 98190 | 92 batches | lr 0.000429 | ms/batch 5132.57 | loss 0.72 | bpc 1.03866 | |
| epoch 144 step 98200 | 102 batches | lr 0.000429 | ms/batch 5127.51 | loss 0.73 | bpc 1.05558 | |
| epoch 144 step 98210 | 112 batches | lr 0.000429 | ms/batch 5129.28 | loss 0.75 | bpc 1.07966 | |
| epoch 144 step 98220 | 122 batches | lr 0.000429 | ms/batch 5129.34 | loss 0.75 | bpc 1.07959 | |
| epoch 144 step 98230 | 132 batches | lr 0.000429 | ms/batch 5130.99 | loss 0.76 | bpc 1.09725 | |
| epoch 144 step 98240 | 142 batches | lr 0.000429 | ms/batch 5131.55 | loss 0.80 | bpc 1.15291 | |
| epoch 144 step 98250 | 152 batches | lr 0.000429 | ms/batch 5131.13 | loss 0.76 | bpc 1.09529 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 393 at step 98250 | time: 1352.76s | valid loss 0.74 | bpc 1.06719 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 144 step 98260 | 162 batches | lr 0.000429 | ms/batch 12446.42 | loss 0.78 | bpc 1.12748 | |
| epoch 144 step 98270 | 172 batches | lr 0.000429 | ms/batch 5131.24 | loss 0.70 | bpc 1.01142 | |
| epoch 144 step 98280 | 182 batches | lr 0.000429 | ms/batch 5133.21 | loss 0.68 | bpc 0.97597 | |
| epoch 144 step 98290 | 192 batches | lr 0.000429 | ms/batch 5131.91 | loss 0.68 | bpc 0.98759 | |
| epoch 144 step 98300 | 202 batches | lr 0.000429 | ms/batch 5132.77 | loss 0.61 | bpc 0.87885 | |
| epoch 144 step 98310 | 212 batches | lr 0.000429 | ms/batch 5133.70 | loss 0.73 | bpc 1.04924 | |
| epoch 144 step 98320 | 222 batches | lr 0.000429 | ms/batch 5130.37 | loss 0.76 | bpc 1.09534 | |
| epoch 144 step 98330 | 232 batches | lr 0.000429 | ms/batch 5132.10 | loss 0.74 | bpc 1.06598 | |
| epoch 144 step 98340 | 242 batches | lr 0.000429 | ms/batch 5132.31 | loss 0.74 | bpc 1.06929 | |
| epoch 144 step 98350 | 252 batches | lr 0.000429 | ms/batch 5134.31 | loss 0.79 | bpc 1.13957 | |
| epoch 144 step 98360 | 262 batches | lr 0.000429 | ms/batch 5124.13 | loss 0.71 | bpc 1.02977 | |
| epoch 144 step 98370 | 272 batches | lr 0.000429 | ms/batch 5127.88 | loss 0.70 | bpc 1.01560 | |
| epoch 144 step 98380 | 282 batches | lr 0.000429 | ms/batch 5133.81 | loss 0.76 | bpc 1.10152 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
| epoch 144 step 98390 | 292 batches | lr 0.000429 | ms/batch 5131.14 | loss 0.76 | bpc 1.10062 | |
| epoch 144 step 98400 | 302 batches | lr 0.000429 | ms/batch 5130.83 | loss 0.75 | bpc 1.08911 | |
| epoch 144 step 98410 | 312 batches | lr 0.000429 | ms/batch 5132.08 | loss 0.78 | bpc 1.11874 | |
| epoch 144 step 98420 | 322 batches | lr 0.000429 | ms/batch 5134.07 | loss 0.73 | bpc 1.05887 | |
| epoch 144 step 98430 | 332 batches | lr 0.000429 | ms/batch 5132.01 | loss 0.77 | bpc 1.10598 | |
| epoch 144 step 98440 | 342 batches | lr 0.000429 | ms/batch 5134.58 | loss 0.70 | bpc 1.00552 | |
| epoch 144 step 98450 | 352 batches | lr 0.000429 | ms/batch 5132.27 | loss 0.76 | bpc 1.10202 | |
| epoch 144 step 98460 | 362 batches | lr 0.000429 | ms/batch 5132.35 | loss 0.77 | bpc 1.10740 | |
| epoch 144 step 98470 | 372 batches | lr 0.000429 | ms/batch 5131.68 | loss 0.72 | bpc 1.03829 | |
| epoch 144 step 98480 | 382 batches | lr 0.000429 | ms/batch 5128.85 | loss 0.76 | bpc 1.09289 | |
| epoch 144 step 98490 | 392 batches | lr 0.000429 | ms/batch 5129.30 | loss 0.75 | bpc 1.08441 | |
| epoch 144 step 98500 | 402 batches | lr 0.000429 | ms/batch 5110.59 | loss 0.79 | bpc 1.13765 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 394 at step 98500 | time: 1350.93s | valid loss 0.74 | bpc 1.07165 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 144 step 98510 | 412 batches | lr 0.000429 | ms/batch 11923.55 | loss 0.75 | bpc 1.07535 | |
| epoch 144 step 98520 | 422 batches | lr 0.000429 | ms/batch 5111.37 | loss 0.73 | bpc 1.05598 | |
| epoch 144 step 98530 | 432 batches | lr 0.000429 | ms/batch 5121.47 | loss 0.74 | bpc 1.06764 | |
| epoch 144 step 98540 | 442 batches | lr 0.000429 | ms/batch 5125.38 | loss 0.77 | bpc 1.11638 | |
| epoch 144 step 98550 | 452 batches | lr 0.000429 | ms/batch 5134.20 | loss 0.76 | bpc 1.09369 | |
| epoch 144 step 98560 | 462 batches | lr 0.000429 | ms/batch 5135.00 | loss 0.75 | bpc 1.08767 | |
| epoch 144 step 98570 | 472 batches | lr 0.000429 | ms/batch 5133.05 | loss 0.74 | bpc 1.06612 | |
| epoch 144 step 98580 | 482 batches | lr 0.000429 | ms/batch 5133.61 | loss 0.76 | bpc 1.09170 | |
| epoch 144 step 98590 | 492 batches | lr 0.000429 | ms/batch 5132.08 | loss 0.74 | bpc 1.06924 | |
| epoch 144 step 98600 | 502 batches | lr 0.000429 | ms/batch 5133.55 | loss 0.74 | bpc 1.06381 | |
| epoch 144 step 98610 | 512 batches | lr 0.000429 | ms/batch 5132.62 | loss 0.74 | bpc 1.06926 | |
| epoch 144 step 98620 | 522 batches | lr 0.000429 | ms/batch 5133.73 | loss 0.75 | bpc 1.08356 | |
| epoch 144 step 98630 | 532 batches | lr 0.000429 | ms/batch 5133.86 | loss 0.79 | bpc 1.13988 | |
| epoch 144 step 98640 | 542 batches | lr 0.000429 | ms/batch 5132.66 | loss 0.75 | bpc 1.07837 | |
| epoch 144 step 98650 | 552 batches | lr 0.000429 | ms/batch 5134.59 | loss 0.76 | bpc 1.08933 | |
| epoch 144 step 98660 | 562 batches | lr 0.000429 | ms/batch 5134.42 | loss 0.77 | bpc 1.11743 | |
| epoch 144 step 98670 | 572 batches | lr 0.000429 | ms/batch 5131.12 | loss 0.75 | bpc 1.07767 | |
| epoch 144 step 98680 | 582 batches | lr 0.000429 | ms/batch 5132.01 | loss 0.73 | bpc 1.05120 | |
| epoch 144 step 98690 | 592 batches | lr 0.000429 | ms/batch 5131.73 | loss 0.77 | bpc 1.11512 | |
| epoch 144 step 98700 | 602 batches | lr 0.000429 | ms/batch 5134.20 | loss 0.74 | bpc 1.06657 | |
| epoch 144 step 98710 | 612 batches | lr 0.000429 | ms/batch 5128.71 | loss 0.72 | bpc 1.04075 | |
| epoch 144 step 98720 | 622 batches | lr 0.000429 | ms/batch 5130.78 | loss 0.77 | bpc 1.10464 | |
| epoch 144 step 98730 | 632 batches | lr 0.000429 | ms/batch 5132.08 | loss 0.76 | bpc 1.09809 | |
| epoch 144 step 98740 | 642 batches | lr 0.000429 | ms/batch 5134.77 | loss 0.75 | bpc 1.08726 | |
| epoch 144 step 98750 | 652 batches | lr 0.000429 | ms/batch 5132.02 | loss 0.79 | bpc 1.13441 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 395 at step 98750 | time: 1350.82s | valid loss 0.74 | bpc 1.07150 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 144 step 98760 | 662 batches | lr 0.000428 | ms/batch 11947.26 | loss 0.78 | bpc 1.12179 | |
| epoch 144 step 98770 | 672 batches | lr 0.000428 | ms/batch 5133.88 | loss 0.72 | bpc 1.04098 | |
| epoch 144 step 98780 | 682 batches | lr 0.000428 | ms/batch 5131.29 | loss 0.71 | bpc 1.01959 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 145 step 98790 | 6 batches | lr 0.000428 | ms/batch 5134.40 | loss 0.71 | bpc 1.02109 | |
| epoch 145 step 98800 | 16 batches | lr 0.000428 | ms/batch 5133.59 | loss 0.74 | bpc 1.07192 | |
| epoch 145 step 98810 | 26 batches | lr 0.000428 | ms/batch 5133.88 | loss 0.78 | bpc 1.12025 | |
| epoch 145 step 98820 | 36 batches | lr 0.000428 | ms/batch 5135.35 | loss 0.71 | bpc 1.01810 | |
| epoch 145 step 98830 | 46 batches | lr 0.000428 | ms/batch 5131.09 | loss 0.74 | bpc 1.06045 | |
| epoch 145 step 98840 | 56 batches | lr 0.000428 | ms/batch 5133.38 | loss 0.77 | bpc 1.11336 | |
| epoch 145 step 98850 | 66 batches | lr 0.000428 | ms/batch 5133.21 | loss 0.76 | bpc 1.09361 | |
| epoch 145 step 98860 | 76 batches | lr 0.000428 | ms/batch 5132.11 | loss 0.72 | bpc 1.03301 | |
| epoch 145 step 98870 | 86 batches | lr 0.000428 | ms/batch 5130.95 | loss 0.72 | bpc 1.03271 | |
| epoch 145 step 98880 | 96 batches | lr 0.000428 | ms/batch 5132.69 | loss 0.72 | bpc 1.03224 | |
| epoch 145 step 98890 | 106 batches | lr 0.000428 | ms/batch 5133.25 | loss 0.75 | bpc 1.08383 | |
| epoch 145 step 98900 | 116 batches | lr 0.000428 | ms/batch 5131.85 | loss 0.74 | bpc 1.07077 | |
| epoch 145 step 98910 | 126 batches | lr 0.000428 | ms/batch 5130.76 | loss 0.76 | bpc 1.09058 | |
| epoch 145 step 98920 | 136 batches | lr 0.000428 | ms/batch 5130.99 | loss 0.77 | bpc 1.11378 | |
| epoch 145 step 98930 | 146 batches | lr 0.000428 | ms/batch 5128.79 | loss 0.77 | bpc 1.10790 | |
| epoch 145 step 98940 | 156 batches | lr 0.000428 | ms/batch 5132.01 | loss 0.77 | bpc 1.10896 | |
| epoch 145 step 98950 | 166 batches | lr 0.000428 | ms/batch 5132.48 | loss 0.78 | bpc 1.11872 | |
| epoch 145 step 98960 | 176 batches | lr 0.000428 | ms/batch 5132.16 | loss 0.68 | bpc 0.97701 | |
| epoch 145 step 98970 | 186 batches | lr 0.000428 | ms/batch 5134.49 | loss 0.68 | bpc 0.98286 | |
| epoch 145 step 98980 | 196 batches | lr 0.000428 | ms/batch 5132.46 | loss 0.66 | bpc 0.94919 | |
| epoch 145 step 98990 | 206 batches | lr 0.000428 | ms/batch 5132.84 | loss 0.65 | bpc 0.93711 | |
| epoch 145 step 99000 | 216 batches | lr 0.000428 | ms/batch 5131.20 | loss 0.75 | bpc 1.08277 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 396 at step 99000 | time: 1359.99s | valid loss 0.74 | bpc 1.06702 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 145 step 99010 | 226 batches | lr 0.000428 | ms/batch 12480.45 | loss 0.74 | bpc 1.06392 | |
| epoch 145 step 99020 | 236 batches | lr 0.000428 | ms/batch 5132.56 | loss 0.73 | bpc 1.05754 | |
| epoch 145 step 99030 | 246 batches | lr 0.000428 | ms/batch 5127.79 | loss 0.78 | bpc 1.11931 | |
| epoch 145 step 99040 | 256 batches | lr 0.000428 | ms/batch 5135.40 | loss 0.75 | bpc 1.07578 | |
| epoch 145 step 99050 | 266 batches | lr 0.000428 | ms/batch 5132.93 | loss 0.70 | bpc 1.00980 | |
| epoch 145 step 99060 | 276 batches | lr 0.000428 | ms/batch 5134.80 | loss 0.74 | bpc 1.06543 | |
| epoch 145 step 99070 | 286 batches | lr 0.000428 | ms/batch 5134.62 | loss 0.78 | bpc 1.13120 | |
| epoch 145 step 99080 | 296 batches | lr 0.000428 | ms/batch 5114.43 | loss 0.74 | bpc 1.07015 | |
| epoch 145 step 99090 | 306 batches | lr 0.000428 | ms/batch 5111.86 | loss 0.75 | bpc 1.08047 | |
| epoch 145 step 99100 | 316 batches | lr 0.000428 | ms/batch 5115.50 | loss 0.77 | bpc 1.11436 | |
| epoch 145 step 99110 | 326 batches | lr 0.000428 | ms/batch 5111.48 | loss 0.75 | bpc 1.07498 | |
| epoch 145 step 99120 | 336 batches | lr 0.000428 | ms/batch 5110.55 | loss 0.75 | bpc 1.07500 | |
| epoch 145 step 99130 | 346 batches | lr 0.000428 | ms/batch 5125.55 | loss 0.72 | bpc 1.03481 | |
| epoch 145 step 99140 | 356 batches | lr 0.000428 | ms/batch 5136.73 | loss 0.78 | bpc 1.12544 | |
| epoch 145 step 99150 | 366 batches | lr 0.000428 | ms/batch 5133.65 | loss 0.74 | bpc 1.06368 | |
| epoch 145 step 99160 | 376 batches | lr 0.000428 | ms/batch 5131.94 | loss 0.73 | bpc 1.05290 | |
| epoch 145 step 99170 | 386 batches | lr 0.000428 | ms/batch 5134.06 | loss 0.77 | bpc 1.10514 | |
| epoch 145 step 99180 | 396 batches | lr 0.000428 | ms/batch 5132.04 | loss 0.76 | bpc 1.09312 | |
| epoch 145 step 99190 | 406 batches | lr 0.000428 | ms/batch 5133.55 | loss 0.76 | bpc 1.10321 | |
| epoch 145 step 99200 | 416 batches | lr 0.000428 | ms/batch 5131.40 | loss 0.75 | bpc 1.07884 | |
| epoch 145 step 99210 | 426 batches | lr 0.000428 | ms/batch 5129.94 | loss 0.71 | bpc 1.02966 | |
| epoch 145 step 99220 | 436 batches | lr 0.000428 | ms/batch 5130.14 | loss 0.78 | bpc 1.12073 | |
| epoch 145 step 99230 | 446 batches | lr 0.000428 | ms/batch 5135.62 | loss 0.76 | bpc 1.09138 | |
| epoch 145 step 99240 | 456 batches | lr 0.000428 | ms/batch 5130.62 | loss 0.75 | bpc 1.08856 | |
| epoch 145 step 99250 | 466 batches | lr 0.000428 | ms/batch 5133.87 | loss 0.74 | bpc 1.07163 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 397 at step 99250 | time: 1365.73s | valid loss 0.75 | bpc 1.07557 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 145 step 99260 | 476 batches | lr 0.000428 | ms/batch 11947.37 | loss 0.74 | bpc 1.06465 | |
| epoch 145 step 99270 | 486 batches | lr 0.000428 | ms/batch 5134.08 | loss 0.76 | bpc 1.09069 | |
| epoch 145 step 99280 | 496 batches | lr 0.000428 | ms/batch 5136.64 | loss 0.75 | bpc 1.08220 | |
| epoch 145 step 99290 | 506 batches | lr 0.000428 | ms/batch 5132.63 | loss 0.73 | bpc 1.05702 | |
| epoch 145 step 99300 | 516 batches | lr 0.000428 | ms/batch 5131.95 | loss 0.75 | bpc 1.07947 | |
| epoch 145 step 99310 | 526 batches | lr 0.000428 | ms/batch 5132.71 | loss 0.76 | bpc 1.09374 | |
| epoch 145 step 99320 | 536 batches | lr 0.000428 | ms/batch 5132.41 | loss 0.79 | bpc 1.13437 | |
| epoch 145 step 99330 | 546 batches | lr 0.000428 | ms/batch 5132.46 | loss 0.74 | bpc 1.06502 | |
| epoch 145 step 99340 | 556 batches | lr 0.000428 | ms/batch 5131.54 | loss 0.76 | bpc 1.10366 | |
| epoch 145 step 99350 | 566 batches | lr 0.000428 | ms/batch 5134.02 | loss 0.78 | bpc 1.11852 | |
| epoch 145 step 99360 | 576 batches | lr 0.000428 | ms/batch 5134.50 | loss 0.73 | bpc 1.05903 | |
| epoch 145 step 99370 | 586 batches | lr 0.000428 | ms/batch 5133.40 | loss 0.74 | bpc 1.06825 | |
| epoch 145 step 99380 | 596 batches | lr 0.000428 | ms/batch 5131.67 | loss 0.77 | bpc 1.10604 | |
| epoch 145 step 99390 | 606 batches | lr 0.000428 | ms/batch 5133.88 | loss 0.74 | bpc 1.06177 | |
| epoch 145 step 99400 | 616 batches | lr 0.000428 | ms/batch 5134.74 | loss 0.73 | bpc 1.05827 | |
| epoch 145 step 99410 | 626 batches | lr 0.000428 | ms/batch 5131.72 | loss 0.75 | bpc 1.08903 | |
| epoch 145 step 99420 | 636 batches | lr 0.000428 | ms/batch 5132.01 | loss 0.77 | bpc 1.10445 | |
| epoch 145 step 99430 | 646 batches | lr 0.000428 | ms/batch 5134.56 | loss 0.76 | bpc 1.09423 | |
| epoch 145 step 99440 | 656 batches | lr 0.000428 | ms/batch 5134.64 | loss 0.79 | bpc 1.13376 | |
| epoch 145 step 99450 | 666 batches | lr 0.000428 | ms/batch 5132.58 | loss 0.78 | bpc 1.12000 | |
| epoch 145 step 99460 | 676 batches | lr 0.000428 | ms/batch 5133.97 | loss 0.70 | bpc 1.00711 | |
| epoch 145 step 99470 | 686 batches | lr 0.000428 | ms/batch 5134.62 | loss 0.69 | bpc 0.99857 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 146 step 99480 | 10 batches | lr 0.000427 | ms/batch 5133.39 | loss 0.73 | bpc 1.04966 | |
| epoch 146 step 99490 | 20 batches | lr 0.000427 | ms/batch 5131.46 | loss 0.76 | bpc 1.09433 | |
| epoch 146 step 99500 | 30 batches | lr 0.000427 | ms/batch 5134.10 | loss 0.75 | bpc 1.08155 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 398 at step 99500 | time: 1355.92s | valid loss 0.74 | bpc 1.07221 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 146 step 99510 | 40 batches | lr 0.000427 | ms/batch 11948.10 | loss 0.71 | bpc 1.02970 | |
| epoch 146 step 99520 | 50 batches | lr 0.000427 | ms/batch 5133.57 | loss 0.75 | bpc 1.08887 | |
| epoch 146 step 99530 | 60 batches | lr 0.000427 | ms/batch 5134.67 | loss 0.77 | bpc 1.10436 | |
| epoch 146 step 99540 | 70 batches | lr 0.000427 | ms/batch 5130.37 | loss 0.74 | bpc 1.06984 | |
| epoch 146 step 99550 | 80 batches | lr 0.000427 | ms/batch 5134.16 | loss 0.70 | bpc 1.01636 | |
| epoch 146 step 99560 | 90 batches | lr 0.000427 | ms/batch 5133.49 | loss 0.73 | bpc 1.05685 | |
| epoch 146 step 99570 | 100 batches | lr 0.000427 | ms/batch 5132.06 | loss 0.71 | bpc 1.01796 | |
| epoch 146 step 99580 | 110 batches | lr 0.000427 | ms/batch 5125.11 | loss 0.76 | bpc 1.10251 | |
| epoch 146 step 99590 | 120 batches | lr 0.000427 | ms/batch 5132.78 | loss 0.73 | bpc 1.05633 | |
| epoch 146 step 99600 | 130 batches | lr 0.000427 | ms/batch 5135.89 | loss 0.78 | bpc 1.11967 | |
| epoch 146 step 99610 | 140 batches | lr 0.000427 | ms/batch 5132.50 | loss 0.79 | bpc 1.14019 | |
| epoch 146 step 99620 | 150 batches | lr 0.000427 | ms/batch 5130.70 | loss 0.75 | bpc 1.07989 | |
| epoch 146 step 99630 | 160 batches | lr 0.000427 | ms/batch 5132.27 | loss 0.78 | bpc 1.13139 | |
| epoch 146 step 99640 | 170 batches | lr 0.000427 | ms/batch 5131.10 | loss 0.73 | bpc 1.05319 | |
| epoch 146 step 99650 | 180 batches | lr 0.000427 | ms/batch 5132.63 | loss 0.67 | bpc 0.97087 | |
| epoch 146 step 99660 | 190 batches | lr 0.000427 | ms/batch 5132.36 | loss 0.67 | bpc 0.97052 | |
| epoch 146 step 99670 | 200 batches | lr 0.000427 | ms/batch 5136.83 | loss 0.63 | bpc 0.91016 | |
| epoch 146 step 99680 | 210 batches | lr 0.000427 | ms/batch 5130.74 | loss 0.70 | bpc 1.00882 | |
| epoch 146 step 99690 | 220 batches | lr 0.000427 | ms/batch 5133.01 | loss 0.77 | bpc 1.10412 | |
| epoch 146 step 99700 | 230 batches | lr 0.000427 | ms/batch 5131.87 | loss 0.74 | bpc 1.06412 | |
| epoch 146 step 99710 | 240 batches | lr 0.000427 | ms/batch 5131.69 | loss 0.74 | bpc 1.06468 | |
| epoch 146 step 99720 | 250 batches | lr 0.000427 | ms/batch 5110.58 | loss 0.78 | bpc 1.12695 | |
| epoch 146 step 99730 | 260 batches | lr 0.000427 | ms/batch 5112.41 | loss 0.72 | bpc 1.04057 | |
| epoch 146 step 99740 | 270 batches | lr 0.000427 | ms/batch 5111.76 | loss 0.71 | bpc 1.01831 | |
| epoch 146 step 99750 | 280 batches | lr 0.000427 | ms/batch 5112.48 | loss 0.75 | bpc 1.08026 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 399 at step 99750 | time: 1350.44s | valid loss 0.74 | bpc 1.07086 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 146 step 99760 | 290 batches | lr 0.000427 | ms/batch 11936.34 | loss 0.77 | bpc 1.11779 | |
| epoch 146 step 99770 | 300 batches | lr 0.000427 | ms/batch 5138.48 | loss 0.75 | bpc 1.08153 | |
| epoch 146 step 99780 | 310 batches | lr 0.000427 | ms/batch 5133.68 | loss 0.77 | bpc 1.11038 | |
| epoch 146 step 99790 | 320 batches | lr 0.000427 | ms/batch 5135.88 | loss 0.73 | bpc 1.05695 | |
| epoch 146 step 99800 | 330 batches | lr 0.000427 | ms/batch 5131.38 | loss 0.76 | bpc 1.09577 | |
| epoch 146 step 99810 | 340 batches | lr 0.000427 | ms/batch 5130.32 | loss 0.72 | bpc 1.04091 | |
| epoch 146 step 99820 | 350 batches | lr 0.000427 | ms/batch 5137.12 | loss 0.74 | bpc 1.06843 | |
| epoch 146 step 99830 | 360 batches | lr 0.000427 | ms/batch 5131.24 | loss 0.79 | bpc 1.13555 | |
| epoch 146 step 99840 | 370 batches | lr 0.000427 | ms/batch 5132.55 | loss 0.73 | bpc 1.05109 | |
| epoch 146 step 99850 | 380 batches | lr 0.000427 | ms/batch 5134.16 | loss 0.75 | bpc 1.07578 | |
| epoch 146 step 99860 | 390 batches | lr 0.000427 | ms/batch 5131.61 | loss 0.74 | bpc 1.07312 | |
| epoch 146 step 99870 | 400 batches | lr 0.000427 | ms/batch 5131.99 | loss 0.79 | bpc 1.13420 | |
| epoch 146 step 99880 | 410 batches | lr 0.000427 | ms/batch 5134.14 | loss 0.76 | bpc 1.09191 | |
| epoch 146 step 99890 | 420 batches | lr 0.000427 | ms/batch 5134.22 | loss 0.73 | bpc 1.04993 | |
| epoch 146 step 99900 | 430 batches | lr 0.000427 | ms/batch 5130.97 | loss 0.73 | bpc 1.05848 | |
| epoch 146 step 99910 | 440 batches | lr 0.000427 | ms/batch 5132.08 | loss 0.77 | bpc 1.10648 | |
| epoch 146 step 99920 | 450 batches | lr 0.000427 | ms/batch 5132.87 | loss 0.76 | bpc 1.09308 | |
| epoch 146 step 99930 | 460 batches | lr 0.000427 | ms/batch 5132.49 | loss 0.76 | bpc 1.08927 | |
| epoch 146 step 99940 | 470 batches | lr 0.000427 | ms/batch 5130.56 | loss 0.73 | bpc 1.05161 | |
| epoch 146 step 99950 | 480 batches | lr 0.000427 | ms/batch 5129.88 | loss 0.76 | bpc 1.10015 | |
| epoch 146 step 99960 | 490 batches | lr 0.000427 | ms/batch 5132.43 | loss 0.74 | bpc 1.06973 | |
| epoch 146 step 99970 | 500 batches | lr 0.000427 | ms/batch 5132.82 | loss 0.74 | bpc 1.06414 | |
| epoch 146 step 99980 | 510 batches | lr 0.000427 | ms/batch 5131.34 | loss 0.74 | bpc 1.07050 | |
| epoch 146 step 99990 | 520 batches | lr 0.000427 | ms/batch 5133.53 | loss 0.75 | bpc 1.08710 | |
| epoch 146 step 100000 | 530 batches | lr 0.000427 | ms/batch 5131.35 | loss 0.77 | bpc 1.10833 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 400 at step 100000 | time: 1351.32s | valid loss 0.74 | bpc 1.07186 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 146 step 100010 | 540 batches | lr 0.000427 | ms/batch 11944.90 | loss 0.78 | bpc 1.12299 | |
| epoch 146 step 100020 | 550 batches | lr 0.000427 | ms/batch 5133.93 | loss 0.74 | bpc 1.06082 | |
| epoch 146 step 100030 | 560 batches | lr 0.000427 | ms/batch 5137.92 | loss 0.77 | bpc 1.10735 | |
| epoch 146 step 100040 | 570 batches | lr 0.000427 | ms/batch 5134.63 | loss 0.76 | bpc 1.09436 | |
| epoch 146 step 100050 | 580 batches | lr 0.000427 | ms/batch 5133.04 | loss 0.74 | bpc 1.06308 | |
| epoch 146 step 100060 | 590 batches | lr 0.000427 | ms/batch 5133.35 | loss 0.75 | bpc 1.07982 | |
| epoch 146 step 100070 | 600 batches | lr 0.000427 | ms/batch 5131.73 | loss 0.75 | bpc 1.08444 | |
| epoch 146 step 100080 | 610 batches | lr 0.000427 | ms/batch 5133.99 | loss 0.73 | bpc 1.05291 | |
| epoch 146 step 100090 | 620 batches | lr 0.000427 | ms/batch 5131.55 | loss 0.75 | bpc 1.08698 | |
| epoch 146 step 100100 | 630 batches | lr 0.000427 | ms/batch 5132.86 | loss 0.77 | bpc 1.10407 | |
| epoch 146 step 100110 | 640 batches | lr 0.000427 | ms/batch 5132.31 | loss 0.75 | bpc 1.08010 | |
| epoch 146 step 100120 | 650 batches | lr 0.000427 | ms/batch 5130.42 | loss 0.79 | bpc 1.13799 | |
| epoch 146 step 100130 | 660 batches | lr 0.000427 | ms/batch 5127.07 | loss 0.77 | bpc 1.11329 | |
| epoch 146 step 100140 | 670 batches | lr 0.000427 | ms/batch 5132.09 | loss 0.75 | bpc 1.07931 | |
| epoch 146 step 100150 | 680 batches | lr 0.000427 | ms/batch 5130.90 | loss 0.70 | bpc 1.00788 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 147 step 100160 | 4 batches | lr 0.000427 | ms/batch 5134.22 | loss 0.70 | bpc 1.00335 | |
| epoch 147 step 100170 | 14 batches | lr 0.000427 | ms/batch 5131.23 | loss 0.74 | bpc 1.06508 | |
| epoch 147 step 100180 | 24 batches | lr 0.000427 | ms/batch 5133.92 | loss 0.77 | bpc 1.10925 | |
| epoch 147 step 100190 | 34 batches | lr 0.000427 | ms/batch 5135.53 | loss 0.73 | bpc 1.04629 | |
| epoch 147 step 100200 | 44 batches | lr 0.000426 | ms/batch 5133.21 | loss 0.72 | bpc 1.04100 | |
| epoch 147 step 100210 | 54 batches | lr 0.000426 | ms/batch 5132.76 | loss 0.77 | bpc 1.10824 | |
| epoch 147 step 100220 | 64 batches | lr 0.000426 | ms/batch 5131.88 | loss 0.75 | bpc 1.08258 | |
| epoch 147 step 100230 | 74 batches | lr 0.000426 | ms/batch 5131.40 | loss 0.74 | bpc 1.06453 | |
| epoch 147 step 100240 | 84 batches | lr 0.000426 | ms/batch 5132.88 | loss 0.70 | bpc 1.00981 | |
| epoch 147 step 100250 | 94 batches | lr 0.000426 | ms/batch 5131.63 | loss 0.72 | bpc 1.03808 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 401 at step 100250 | time: 1357.19s | valid loss 0.74 | bpc 1.07105 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 147 step 100260 | 104 batches | lr 0.000426 | ms/batch 11947.25 | loss 0.74 | bpc 1.07026 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
| epoch 147 step 100270 | 114 batches | lr 0.000426 | ms/batch 5130.17 | loss 0.74 | bpc 1.07189 | |
| epoch 147 step 100280 | 124 batches | lr 0.000426 | ms/batch 5131.76 | loss 0.76 | bpc 1.09304 | |
| epoch 147 step 100290 | 134 batches | lr 0.000426 | ms/batch 5131.54 | loss 0.76 | bpc 1.09474 | |
| epoch 147 step 100300 | 144 batches | lr 0.000426 | ms/batch 5129.89 | loss 0.78 | bpc 1.12835 | |
| epoch 147 step 100310 | 154 batches | lr 0.000426 | ms/batch 5134.90 | loss 0.76 | bpc 1.09785 | |
| epoch 147 step 100320 | 164 batches | lr 0.000426 | ms/batch 5136.26 | loss 0.78 | bpc 1.13203 | |
| epoch 147 step 100330 | 174 batches | lr 0.000426 | ms/batch 5133.03 | loss 0.68 | bpc 0.97983 | |
| epoch 147 step 100340 | 184 batches | lr 0.000426 | ms/batch 5132.35 | loss 0.68 | bpc 0.98159 | |
| epoch 147 step 100350 | 194 batches | lr 0.000426 | ms/batch 5132.35 | loss 0.69 | bpc 0.99133 | |
| epoch 147 step 100360 | 204 batches | lr 0.000426 | ms/batch 5131.34 | loss 0.62 | bpc 0.89225 | |
| epoch 147 step 100370 | 214 batches | lr 0.000426 | ms/batch 5134.89 | loss 0.73 | bpc 1.05025 | |
| epoch 147 step 100380 | 224 batches | lr 0.000426 | ms/batch 5134.31 | loss 0.74 | bpc 1.07277 | |
| epoch 147 step 100390 | 234 batches | lr 0.000426 | ms/batch 5122.10 | loss 0.74 | bpc 1.06237 | |
| epoch 147 step 100400 | 244 batches | lr 0.000426 | ms/batch 5111.10 | loss 0.75 | bpc 1.08693 | |
| epoch 147 step 100410 | 254 batches | lr 0.000426 | ms/batch 5113.00 | loss 0.78 | bpc 1.12171 | |
| epoch 147 step 100420 | 264 batches | lr 0.000426 | ms/batch 5111.51 | loss 0.70 | bpc 1.01041 | |
| epoch 147 step 100430 | 274 batches | lr 0.000426 | ms/batch 5114.54 | loss 0.72 | bpc 1.03929 | |
| epoch 147 step 100440 | 284 batches | lr 0.000426 | ms/batch 5112.41 | loss 0.77 | bpc 1.10618 | |
| epoch 147 step 100450 | 294 batches | lr 0.000426 | ms/batch 5113.84 | loss 0.76 | bpc 1.09254 | |
| epoch 147 step 100460 | 304 batches | lr 0.000426 | ms/batch 5138.11 | loss 0.76 | bpc 1.09071 | |
| epoch 147 step 100470 | 314 batches | lr 0.000426 | ms/batch 5133.53 | loss 0.76 | bpc 1.10274 | |
| epoch 147 step 100480 | 324 batches | lr 0.000426 | ms/batch 5133.44 | loss 0.75 | bpc 1.07684 | |
| epoch 147 step 100490 | 334 batches | lr 0.000426 | ms/batch 5135.07 | loss 0.75 | bpc 1.08891 | |
| epoch 147 step 100500 | 344 batches | lr 0.000426 | ms/batch 5135.13 | loss 0.71 | bpc 1.02029 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 402 at step 100500 | time: 1354.33s | valid loss 0.74 | bpc 1.06913 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 147 step 100510 | 354 batches | lr 0.000426 | ms/batch 11947.80 | loss 0.78 | bpc 1.12067 | |
| epoch 147 step 100520 | 364 batches | lr 0.000426 | ms/batch 5132.78 | loss 0.75 | bpc 1.07560 | |
| epoch 147 step 100530 | 374 batches | lr 0.000426 | ms/batch 5133.48 | loss 0.72 | bpc 1.04075 | |
| epoch 147 step 100540 | 384 batches | lr 0.000426 | ms/batch 5135.08 | loss 0.76 | bpc 1.09840 | |
| epoch 147 step 100550 | 394 batches | lr 0.000426 | ms/batch 5135.64 | loss 0.75 | bpc 1.08434 | |
| epoch 147 step 100560 | 404 batches | lr 0.000426 | ms/batch 5136.77 | loss 0.77 | bpc 1.10859 | |
| epoch 147 step 100570 | 414 batches | lr 0.000426 | ms/batch 5136.33 | loss 0.75 | bpc 1.08894 | |
| epoch 147 step 100580 | 424 batches | lr 0.000426 | ms/batch 5136.56 | loss 0.72 | bpc 1.03886 | |
| epoch 147 step 100590 | 434 batches | lr 0.000426 | ms/batch 5134.79 | loss 0.76 | bpc 1.09524 | |
| epoch 147 step 100600 | 444 batches | lr 0.000426 | ms/batch 5134.15 | loss 0.76 | bpc 1.09388 | |
| epoch 147 step 100610 | 454 batches | lr 0.000426 | ms/batch 5133.53 | loss 0.77 | bpc 1.10520 | |
| epoch 147 step 100620 | 464 batches | lr 0.000426 | ms/batch 5131.79 | loss 0.74 | bpc 1.07481 | |
| epoch 147 step 100630 | 474 batches | lr 0.000426 | ms/batch 5134.80 | loss 0.73 | bpc 1.05763 | |
| epoch 147 step 100640 | 484 batches | lr 0.000426 | ms/batch 5132.17 | loss 0.75 | bpc 1.08677 | |
| epoch 147 step 100650 | 494 batches | lr 0.000426 | ms/batch 5136.87 | loss 0.75 | bpc 1.08864 | |
| epoch 147 step 100660 | 504 batches | lr 0.000426 | ms/batch 5133.09 | loss 0.73 | bpc 1.05479 | |
| epoch 147 step 100670 | 514 batches | lr 0.000426 | ms/batch 5138.58 | loss 0.74 | bpc 1.06438 | |
| epoch 147 step 100680 | 524 batches | lr 0.000426 | ms/batch 5135.46 | loss 0.75 | bpc 1.08040 | |
| epoch 147 step 100690 | 534 batches | lr 0.000426 | ms/batch 5132.76 | loss 0.79 | bpc 1.14504 | |
| epoch 147 step 100700 | 544 batches | lr 0.000426 | ms/batch 5135.74 | loss 0.74 | bpc 1.07028 | |
| epoch 147 step 100710 | 554 batches | lr 0.000426 | ms/batch 5133.96 | loss 0.76 | bpc 1.09573 | |
| epoch 147 step 100720 | 564 batches | lr 0.000426 | ms/batch 5135.63 | loss 0.77 | bpc 1.11092 | |
| epoch 147 step 100730 | 574 batches | lr 0.000426 | ms/batch 5134.48 | loss 0.74 | bpc 1.06933 | |
| epoch 147 step 100740 | 584 batches | lr 0.000426 | ms/batch 5132.90 | loss 0.73 | bpc 1.05971 | |
| epoch 147 step 100750 | 594 batches | lr 0.000426 | ms/batch 5132.31 | loss 0.76 | bpc 1.10201 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 403 at step 100750 | time: 1356.26s | valid loss 0.74 | bpc 1.07222 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 147 step 100760 | 604 batches | lr 0.000426 | ms/batch 11951.81 | loss 0.75 | bpc 1.07534 | |
| epoch 147 step 100770 | 614 batches | lr 0.000426 | ms/batch 5136.51 | loss 0.72 | bpc 1.04268 | |
| epoch 147 step 100780 | 624 batches | lr 0.000426 | ms/batch 5139.92 | loss 0.76 | bpc 1.08991 | |
| epoch 147 step 100790 | 634 batches | lr 0.000426 | ms/batch 5135.86 | loss 0.77 | bpc 1.10746 | |
| epoch 147 step 100800 | 644 batches | lr 0.000426 | ms/batch 5133.34 | loss 0.75 | bpc 1.07886 | |
| epoch 147 step 100810 | 654 batches | lr 0.000426 | ms/batch 5139.22 | loss 0.80 | bpc 1.15350 | |
| epoch 147 step 100820 | 664 batches | lr 0.000426 | ms/batch 5133.20 | loss 0.77 | bpc 1.11341 | |
| epoch 147 step 100830 | 674 batches | lr 0.000426 | ms/batch 5135.64 | loss 0.71 | bpc 1.02121 | |
| epoch 147 step 100840 | 684 batches | lr 0.000426 | ms/batch 5136.21 | loss 0.70 | bpc 1.00793 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 148 step 100850 | 8 batches | lr 0.000426 | ms/batch 5137.10 | loss 0.72 | bpc 1.03628 | |
| epoch 148 step 100860 | 18 batches | lr 0.000426 | ms/batch 5131.99 | loss 0.75 | bpc 1.07765 | |
| epoch 148 step 100870 | 28 batches | lr 0.000426 | ms/batch 5129.95 | loss 0.76 | bpc 1.10232 | |
| epoch 148 step 100880 | 38 batches | lr 0.000426 | ms/batch 5130.93 | loss 0.71 | bpc 1.02995 | |
| epoch 148 step 100890 | 48 batches | lr 0.000426 | ms/batch 5136.46 | loss 0.73 | bpc 1.05832 | |
| epoch 148 step 100900 | 58 batches | lr 0.000426 | ms/batch 5133.60 | loss 0.77 | bpc 1.10821 | |
| epoch 148 step 100910 | 68 batches | lr 0.000426 | ms/batch 5132.46 | loss 0.76 | bpc 1.08935 | |
| epoch 148 step 100920 | 78 batches | lr 0.000425 | ms/batch 5132.69 | loss 0.71 | bpc 1.02175 | |
| epoch 148 step 100930 | 88 batches | lr 0.000425 | ms/batch 5138.33 | loss 0.72 | bpc 1.03857 | |
| epoch 148 step 100940 | 98 batches | lr 0.000425 | ms/batch 5134.94 | loss 0.71 | bpc 1.01942 | |
| epoch 148 step 100950 | 108 batches | lr 0.000425 | ms/batch 5135.74 | loss 0.76 | bpc 1.10065 | |
| epoch 148 step 100960 | 118 batches | lr 0.000425 | ms/batch 5135.96 | loss 0.74 | bpc 1.06654 | |
| epoch 148 step 100970 | 128 batches | lr 0.000425 | ms/batch 5135.47 | loss 0.77 | bpc 1.11065 | |
| epoch 148 step 100980 | 138 batches | lr 0.000425 | ms/batch 5135.56 | loss 0.78 | bpc 1.12438 | |
| epoch 148 step 100990 | 148 batches | lr 0.000425 | ms/batch 5136.93 | loss 0.74 | bpc 1.07237 | |
| epoch 148 step 101000 | 158 batches | lr 0.000425 | ms/batch 5138.53 | loss 0.78 | bpc 1.13211 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 404 at step 101000 | time: 1352.10s | valid loss 0.75 | bpc 1.07575 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 148 step 101010 | 168 batches | lr 0.000425 | ms/batch 11952.52 | loss 0.75 | bpc 1.08429 | |
| epoch 148 step 101020 | 178 batches | lr 0.000425 | ms/batch 5133.86 | loss 0.67 | bpc 0.96932 | |
| epoch 148 step 101030 | 188 batches | lr 0.000425 | ms/batch 5137.40 | loss 0.68 | bpc 0.97547 | |
| epoch 148 step 101040 | 198 batches | lr 0.000425 | ms/batch 5135.94 | loss 0.65 | bpc 0.93763 | |
| epoch 148 step 101050 | 208 batches | lr 0.000425 | ms/batch 5136.13 | loss 0.67 | bpc 0.96370 | |
| epoch 148 step 101060 | 218 batches | lr 0.000425 | ms/batch 5135.03 | loss 0.75 | bpc 1.08903 | |
| epoch 148 step 101070 | 228 batches | lr 0.000425 | ms/batch 5134.44 | loss 0.74 | bpc 1.06899 | |
| epoch 148 step 101080 | 238 batches | lr 0.000425 | ms/batch 5133.32 | loss 0.73 | bpc 1.05981 | |
| epoch 148 step 101090 | 248 batches | lr 0.000425 | ms/batch 5134.44 | loss 0.78 | bpc 1.11979 | |
| epoch 148 step 101100 | 258 batches | lr 0.000425 | ms/batch 5137.77 | loss 0.74 | bpc 1.07294 | |
| epoch 148 step 101110 | 268 batches | lr 0.000425 | ms/batch 5138.58 | loss 0.69 | bpc 1.00018 | |
| epoch 148 step 101120 | 278 batches | lr 0.000425 | ms/batch 5135.20 | loss 0.74 | bpc 1.07188 | |
| epoch 148 step 101130 | 288 batches | lr 0.000425 | ms/batch 5138.70 | loss 0.78 | bpc 1.12651 | |
| epoch 148 step 101140 | 298 batches | lr 0.000425 | ms/batch 5137.81 | loss 0.74 | bpc 1.07470 | |
| epoch 148 step 101150 | 308 batches | lr 0.000425 | ms/batch 5119.18 | loss 0.76 | bpc 1.08964 | |
| epoch 148 step 101160 | 318 batches | lr 0.000425 | ms/batch 5110.16 | loss 0.75 | bpc 1.08753 | |
| epoch 148 step 101170 | 328 batches | lr 0.000425 | ms/batch 5110.39 | loss 0.75 | bpc 1.08893 | |
| epoch 148 step 101180 | 338 batches | lr 0.000425 | ms/batch 5111.73 | loss 0.73 | bpc 1.04763 | |
| epoch 148 step 101190 | 348 batches | lr 0.000425 | ms/batch 5111.74 | loss 0.72 | bpc 1.03766 | |
| epoch 148 step 101200 | 358 batches | lr 0.000425 | ms/batch 5113.41 | loss 0.79 | bpc 1.13964 | |
| epoch 148 step 101210 | 368 batches | lr 0.000425 | ms/batch 5111.54 | loss 0.74 | bpc 1.06073 | |
| epoch 148 step 101220 | 378 batches | lr 0.000425 | ms/batch 5130.00 | loss 0.73 | bpc 1.05980 | |
| epoch 148 step 101230 | 388 batches | lr 0.000425 | ms/batch 5138.08 | loss 0.77 | bpc 1.10561 | |
| epoch 148 step 101240 | 398 batches | lr 0.000425 | ms/batch 5137.90 | loss 0.76 | bpc 1.09865 | |
| epoch 148 step 101250 | 408 batches | lr 0.000425 | ms/batch 5139.71 | loss 0.76 | bpc 1.10031 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 405 at step 101250 | time: 1350.61s | valid loss 0.74 | bpc 1.06764 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 148 step 101260 | 418 batches | lr 0.000425 | ms/batch 11949.10 | loss 0.73 | bpc 1.05699 | |
| epoch 148 step 101270 | 428 batches | lr 0.000425 | ms/batch 5132.45 | loss 0.72 | bpc 1.04350 | |
| epoch 148 step 101280 | 438 batches | lr 0.000425 | ms/batch 5136.79 | loss 0.77 | bpc 1.10559 | |
| epoch 148 step 101290 | 448 batches | lr 0.000425 | ms/batch 5135.89 | loss 0.76 | bpc 1.09627 | |
| epoch 148 step 101300 | 458 batches | lr 0.000425 | ms/batch 5136.89 | loss 0.76 | bpc 1.09413 | |
| epoch 148 step 101310 | 468 batches | lr 0.000425 | ms/batch 5139.34 | loss 0.72 | bpc 1.03914 | |
| epoch 148 step 101320 | 478 batches | lr 0.000425 | ms/batch 5137.59 | loss 0.75 | bpc 1.08801 | |
| epoch 148 step 101330 | 488 batches | lr 0.000425 | ms/batch 5143.30 | loss 0.75 | bpc 1.07909 | |
| epoch 148 step 101340 | 498 batches | lr 0.000425 | ms/batch 5139.47 | loss 0.74 | bpc 1.07060 | |
| epoch 148 step 101350 | 508 batches | lr 0.000425 | ms/batch 5139.49 | loss 0.75 | bpc 1.07993 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
| epoch 148 step 101360 | 518 batches | lr 0.000425 | ms/batch 5134.76 | loss 0.74 | bpc 1.06622 | |
| epoch 148 step 101370 | 528 batches | lr 0.000425 | ms/batch 5137.14 | loss 0.76 | bpc 1.10168 | |
| epoch 148 step 101380 | 538 batches | lr 0.000425 | ms/batch 5136.89 | loss 0.78 | bpc 1.13203 | |
| epoch 148 step 101390 | 548 batches | lr 0.000425 | ms/batch 5139.33 | loss 0.73 | bpc 1.05468 | |
| epoch 148 step 101400 | 558 batches | lr 0.000425 | ms/batch 5138.22 | loss 0.76 | bpc 1.09788 | |
| epoch 148 step 101410 | 568 batches | lr 0.000425 | ms/batch 5139.85 | loss 0.77 | bpc 1.11243 | |
| epoch 148 step 101420 | 578 batches | lr 0.000425 | ms/batch 5136.87 | loss 0.73 | bpc 1.05150 | |
| epoch 148 step 101430 | 588 batches | lr 0.000425 | ms/batch 5136.67 | loss 0.75 | bpc 1.08149 | |
| epoch 148 step 101440 | 598 batches | lr 0.000425 | ms/batch 5133.66 | loss 0.76 | bpc 1.09418 | |
| epoch 148 step 101450 | 608 batches | lr 0.000425 | ms/batch 5133.59 | loss 0.74 | bpc 1.06118 | |
| epoch 148 step 101460 | 618 batches | lr 0.000425 | ms/batch 5137.08 | loss 0.74 | bpc 1.07279 | |
| epoch 148 step 101470 | 628 batches | lr 0.000425 | ms/batch 5141.77 | loss 0.76 | bpc 1.10073 | |
| epoch 148 step 101480 | 638 batches | lr 0.000425 | ms/batch 5136.48 | loss 0.74 | bpc 1.06419 | |
| epoch 148 step 101490 | 648 batches | lr 0.000425 | ms/batch 5137.86 | loss 0.79 | bpc 1.13849 | |
| epoch 148 step 101500 | 658 batches | lr 0.000425 | ms/batch 5136.17 | loss 0.77 | bpc 1.11173 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 406 at step 101500 | time: 1352.71s | valid loss 0.74 | bpc 1.07176 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 148 step 101510 | 668 batches | lr 0.000425 | ms/batch 11949.99 | loss 0.77 | bpc 1.10574 | |
| epoch 148 step 101520 | 678 batches | lr 0.000425 | ms/batch 5138.57 | loss 0.69 | bpc 1.00024 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
| epoch 149 step 101530 | 2 batches | lr 0.000425 | ms/batch 5137.52 | loss 0.69 | bpc 0.99837 | |
| epoch 149 step 101540 | 12 batches | lr 0.000425 | ms/batch 5139.90 | loss 0.74 | bpc 1.06567 | |
| epoch 149 step 101550 | 22 batches | lr 0.000425 | ms/batch 5135.73 | loss 0.77 | bpc 1.10631 | |
| epoch 149 step 101560 | 32 batches | lr 0.000425 | ms/batch 5140.58 | loss 0.73 | bpc 1.05398 | |
| epoch 149 step 101570 | 42 batches | lr 0.000425 | ms/batch 5138.21 | loss 0.72 | bpc 1.03804 | |
| epoch 149 step 101580 | 52 batches | lr 0.000425 | ms/batch 5138.76 | loss 0.75 | bpc 1.08633 | |
| epoch 149 step 101590 | 62 batches | lr 0.000425 | ms/batch 5139.06 | loss 0.75 | bpc 1.08207 | |
| epoch 149 step 101600 | 72 batches | lr 0.000425 | ms/batch 5137.03 | loss 0.75 | bpc 1.08588 | |
| epoch 149 step 101610 | 82 batches | lr 0.000425 | ms/batch 5139.12 | loss 0.70 | bpc 1.01112 | |
| epoch 149 step 101620 | 92 batches | lr 0.000425 | ms/batch 5141.47 | loss 0.72 | bpc 1.03658 | |
| epoch 149 step 101630 | 102 batches | lr 0.000424 | ms/batch 5138.61 | loss 0.73 | bpc 1.05425 | |
| epoch 149 step 101640 | 112 batches | lr 0.000424 | ms/batch 5141.74 | loss 0.75 | bpc 1.07843 | |
| epoch 149 step 101650 | 122 batches | lr 0.000424 | ms/batch 5133.86 | loss 0.75 | bpc 1.08025 | |
| epoch 149 step 101660 | 132 batches | lr 0.000424 | ms/batch 5136.67 | loss 0.76 | bpc 1.09384 | |
| epoch 149 step 101670 | 142 batches | lr 0.000424 | ms/batch 5140.33 | loss 0.80 | bpc 1.15137 | |
| epoch 149 step 101680 | 152 batches | lr 0.000424 | ms/batch 5139.26 | loss 0.76 | bpc 1.09299 | |
| epoch 149 step 101690 | 162 batches | lr 0.000424 | ms/batch 5136.88 | loss 0.78 | bpc 1.12396 | |
| epoch 149 step 101700 | 172 batches | lr 0.000424 | ms/batch 5133.44 | loss 0.70 | bpc 1.00962 | |
| epoch 149 step 101710 | 182 batches | lr 0.000424 | ms/batch 5138.53 | loss 0.67 | bpc 0.97317 | |
| epoch 149 step 101720 | 192 batches | lr 0.000424 | ms/batch 5138.72 | loss 0.68 | bpc 0.98611 | |
| epoch 149 step 101730 | 202 batches | lr 0.000424 | ms/batch 5136.32 | loss 0.61 | bpc 0.87611 | |
| epoch 149 step 101740 | 212 batches | lr 0.000424 | ms/batch 5137.44 | loss 0.72 | bpc 1.04564 | |
| epoch 149 step 101750 | 222 batches | lr 0.000424 | ms/batch 5141.88 | loss 0.76 | bpc 1.08981 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 407 at step 101750 | time: 1352.75s | valid loss 0.74 | bpc 1.07164 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 149 step 101760 | 232 batches | lr 0.000424 | ms/batch 11952.14 | loss 0.74 | bpc 1.06380 | |
| epoch 149 step 101770 | 242 batches | lr 0.000424 | ms/batch 5137.26 | loss 0.74 | bpc 1.06647 | |
| epoch 149 step 101780 | 252 batches | lr 0.000424 | ms/batch 5135.48 | loss 0.79 | bpc 1.13786 | |
| epoch 149 step 101790 | 262 batches | lr 0.000424 | ms/batch 5137.07 | loss 0.71 | bpc 1.02473 | |
| epoch 149 step 101800 | 272 batches | lr 0.000424 | ms/batch 5134.57 | loss 0.70 | bpc 1.01265 | |
| epoch 149 step 101810 | 282 batches | lr 0.000424 | ms/batch 5141.22 | loss 0.76 | bpc 1.09783 | |
| epoch 149 step 101820 | 292 batches | lr 0.000424 | ms/batch 5135.66 | loss 0.76 | bpc 1.09710 | |
| epoch 149 step 101830 | 302 batches | lr 0.000424 | ms/batch 5137.48 | loss 0.75 | bpc 1.08652 | |
| epoch 149 step 101840 | 312 batches | lr 0.000424 | ms/batch 5137.33 | loss 0.77 | bpc 1.11660 | |
| epoch 149 step 101850 | 322 batches | lr 0.000424 | ms/batch 5137.55 | loss 0.73 | bpc 1.05737 | |
| epoch 149 step 101860 | 332 batches | lr 0.000424 | ms/batch 5136.57 | loss 0.77 | bpc 1.10586 | |
| epoch 149 step 101870 | 342 batches | lr 0.000424 | ms/batch 5140.36 | loss 0.70 | bpc 1.00519 | |
| epoch 149 step 101880 | 352 batches | lr 0.000424 | ms/batch 5137.52 | loss 0.76 | bpc 1.10141 | |
| epoch 149 step 101890 | 362 batches | lr 0.000424 | ms/batch 5138.64 | loss 0.76 | bpc 1.10317 | |
| epoch 149 step 101900 | 372 batches | lr 0.000424 | ms/batch 5139.07 | loss 0.72 | bpc 1.03524 | |
| epoch 149 step 101910 | 382 batches | lr 0.000424 | ms/batch 5139.80 | loss 0.76 | bpc 1.09094 | |
| epoch 149 step 101920 | 392 batches | lr 0.000424 | ms/batch 5134.38 | loss 0.75 | bpc 1.08066 | |
| epoch 149 step 101930 | 402 batches | lr 0.000424 | ms/batch 5138.10 | loss 0.79 | bpc 1.13402 | |
| epoch 149 step 101940 | 412 batches | lr 0.000424 | ms/batch 5135.59 | loss 0.74 | bpc 1.07405 | |
| epoch 149 step 101950 | 422 batches | lr 0.000424 | ms/batch 5138.66 | loss 0.73 | bpc 1.05106 | |
| epoch 149 step 101960 | 432 batches | lr 0.000424 | ms/batch 5140.52 | loss 0.74 | bpc 1.06560 | |
| epoch 149 step 101970 | 442 batches | lr 0.000424 | ms/batch 5140.20 | loss 0.77 | bpc 1.11418 | |
| epoch 149 step 101980 | 452 batches | lr 0.000424 | ms/batch 5134.05 | loss 0.76 | bpc 1.09063 | |
| epoch 149 step 101990 | 462 batches | lr 0.000424 | ms/batch 5137.58 | loss 0.75 | bpc 1.08487 | |
| epoch 149 step 102000 | 472 batches | lr 0.000424 | ms/batch 5139.30 | loss 0.74 | bpc 1.06360 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 408 at step 102000 | time: 1352.56s | valid loss 0.74 | bpc 1.07360 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 149 step 102010 | 482 batches | lr 0.000424 | ms/batch 11916.66 | loss 0.75 | bpc 1.08851 | |
| epoch 149 step 102020 | 492 batches | lr 0.000424 | ms/batch 5112.55 | loss 0.74 | bpc 1.06774 | |
| epoch 149 step 102030 | 502 batches | lr 0.000424 | ms/batch 5113.05 | loss 0.74 | bpc 1.06282 | |
| epoch 149 step 102040 | 512 batches | lr 0.000424 | ms/batch 5111.73 | loss 0.74 | bpc 1.06766 | |
| epoch 149 step 102050 | 522 batches | lr 0.000424 | ms/batch 5112.78 | loss 0.75 | bpc 1.08245 | |
| epoch 149 step 102060 | 532 batches | lr 0.000424 | ms/batch 5111.02 | loss 0.79 | bpc 1.13798 | |
| epoch 149 step 102070 | 542 batches | lr 0.000424 | ms/batch 5134.60 | loss 0.75 | bpc 1.07704 | |
| epoch 149 step 102080 | 552 batches | lr 0.000424 | ms/batch 5139.50 | loss 0.75 | bpc 1.08570 | |
| epoch 149 step 102090 | 562 batches | lr 0.000424 | ms/batch 5138.34 | loss 0.77 | bpc 1.11736 | |
| epoch 149 step 102100 | 572 batches | lr 0.000424 | ms/batch 5141.98 | loss 0.74 | bpc 1.07341 | |
| epoch 149 step 102110 | 582 batches | lr 0.000424 | ms/batch 5138.24 | loss 0.73 | bpc 1.04859 | |
| epoch 149 step 102120 | 592 batches | lr 0.000424 | ms/batch 5141.29 | loss 0.77 | bpc 1.11223 | |
| epoch 149 step 102130 | 602 batches | lr 0.000424 | ms/batch 5142.65 | loss 0.74 | bpc 1.06518 | |
| epoch 149 step 102140 | 612 batches | lr 0.000424 | ms/batch 5139.33 | loss 0.72 | bpc 1.03899 | |
| epoch 149 step 102150 | 622 batches | lr 0.000424 | ms/batch 5136.50 | loss 0.76 | bpc 1.10261 | |
| epoch 149 step 102160 | 632 batches | lr 0.000424 | ms/batch 5137.92 | loss 0.76 | bpc 1.09612 | |
| epoch 149 step 102170 | 642 batches | lr 0.000424 | ms/batch 5139.81 | loss 0.75 | bpc 1.08526 | |
| epoch 149 step 102180 | 652 batches | lr 0.000424 | ms/batch 5139.07 | loss 0.78 | bpc 1.13062 | |
| epoch 149 step 102190 | 662 batches | lr 0.000424 | ms/batch 5131.62 | loss 0.78 | bpc 1.12133 | |
| epoch 149 step 102200 | 672 batches | lr 0.000424 | ms/batch 5133.67 | loss 0.72 | bpc 1.03573 | |
| epoch 149 step 102210 | 682 batches | lr 0.000424 | ms/batch 5139.46 | loss 0.71 | bpc 1.01831 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 150 step 102220 | 6 batches | lr 0.000424 | ms/batch 5141.88 | loss 0.71 | bpc 1.01972 | |
| epoch 150 step 102230 | 16 batches | lr 0.000424 | ms/batch 5139.52 | loss 0.74 | bpc 1.06812 | |
| epoch 150 step 102240 | 26 batches | lr 0.000424 | ms/batch 5140.45 | loss 0.78 | bpc 1.11827 | |
| epoch 150 step 102250 | 36 batches | lr 0.000424 | ms/batch 5139.38 | loss 0.70 | bpc 1.01665 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 409 at step 102250 | time: 1352.06s | valid loss 0.75 | bpc 1.07572 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 150 step 102260 | 46 batches | lr 0.000424 | ms/batch 11952.69 | loss 0.73 | bpc 1.05823 | |
| epoch 150 step 102270 | 56 batches | lr 0.000424 | ms/batch 5139.24 | loss 0.77 | bpc 1.11028 | |
| epoch 150 step 102280 | 66 batches | lr 0.000424 | ms/batch 5141.48 | loss 0.76 | bpc 1.09239 | |
| epoch 150 step 102290 | 76 batches | lr 0.000424 | ms/batch 5139.39 | loss 0.71 | bpc 1.03041 | |
| epoch 150 step 102300 | 86 batches | lr 0.000424 | ms/batch 5133.58 | loss 0.71 | bpc 1.02768 | |
| epoch 150 step 102310 | 96 batches | lr 0.000424 | ms/batch 5135.01 | loss 0.71 | bpc 1.02901 | |
| epoch 150 step 102320 | 106 batches | lr 0.000424 | ms/batch 5138.24 | loss 0.75 | bpc 1.08286 | |
| epoch 150 step 102330 | 116 batches | lr 0.000424 | ms/batch 5142.09 | loss 0.74 | bpc 1.07194 | |
| epoch 150 step 102340 | 126 batches | lr 0.000423 | ms/batch 5139.32 | loss 0.75 | bpc 1.08635 | |
| epoch 150 step 102350 | 136 batches | lr 0.000423 | ms/batch 5138.91 | loss 0.77 | bpc 1.11044 | |
| epoch 150 step 102360 | 146 batches | lr 0.000423 | ms/batch 5134.72 | loss 0.77 | bpc 1.10607 | |
| epoch 150 step 102370 | 156 batches | lr 0.000423 | ms/batch 5139.40 | loss 0.77 | bpc 1.10760 | |
| epoch 150 step 102380 | 166 batches | lr 0.000423 | ms/batch 5138.01 | loss 0.77 | bpc 1.11706 | |
| epoch 150 step 102390 | 176 batches | lr 0.000423 | ms/batch 5138.40 | loss 0.68 | bpc 0.97499 | |
| epoch 150 step 102400 | 186 batches | lr 0.000423 | ms/batch 5137.77 | loss 0.68 | bpc 0.97980 | |
| epoch 150 step 102410 | 196 batches | lr 0.000423 | ms/batch 5141.53 | loss 0.66 | bpc 0.94869 | |
| epoch 150 step 102420 | 206 batches | lr 0.000423 | ms/batch 5142.00 | loss 0.65 | bpc 0.93256 | |
| epoch 150 step 102430 | 216 batches | lr 0.000423 | ms/batch 5135.85 | loss 0.75 | bpc 1.07755 | |
| epoch 150 step 102440 | 226 batches | lr 0.000423 | ms/batch 5135.69 | loss 0.74 | bpc 1.06292 | |
| epoch 150 step 102450 | 236 batches | lr 0.000423 | ms/batch 5137.18 | loss 0.73 | bpc 1.05491 | |
| epoch 150 step 102460 | 246 batches | lr 0.000423 | ms/batch 5141.74 | loss 0.78 | bpc 1.12033 | |
| epoch 150 step 102470 | 256 batches | lr 0.000423 | ms/batch 5134.34 | loss 0.74 | bpc 1.07473 | |
| epoch 150 step 102480 | 266 batches | lr 0.000423 | ms/batch 5127.36 | loss 0.70 | bpc 1.00529 | |
| epoch 150 step 102490 | 276 batches | lr 0.000423 | ms/batch 5139.47 | loss 0.74 | bpc 1.06253 | |
| epoch 150 step 102500 | 286 batches | lr 0.000423 | ms/batch 5142.70 | loss 0.78 | bpc 1.12674 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 410 at step 102500 | time: 1352.73s | valid loss 0.74 | bpc 1.06951 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 150 step 102510 | 296 batches | lr 0.000423 | ms/batch 11951.56 | loss 0.74 | bpc 1.06885 | |
| epoch 150 step 102520 | 306 batches | lr 0.000423 | ms/batch 5137.37 | loss 0.75 | bpc 1.07694 | |
| epoch 150 step 102530 | 316 batches | lr 0.000423 | ms/batch 5139.50 | loss 0.77 | bpc 1.11137 | |
| epoch 150 step 102540 | 326 batches | lr 0.000423 | ms/batch 5138.39 | loss 0.75 | bpc 1.07547 | |
| epoch 150 step 102550 | 336 batches | lr 0.000423 | ms/batch 5135.67 | loss 0.74 | bpc 1.07426 | |
| epoch 150 step 102560 | 346 batches | lr 0.000423 | ms/batch 5132.89 | loss 0.71 | bpc 1.03140 | |
| epoch 150 step 102570 | 356 batches | lr 0.000423 | ms/batch 5137.90 | loss 0.78 | bpc 1.12098 | |
| epoch 150 step 102580 | 366 batches | lr 0.000423 | ms/batch 5138.51 | loss 0.74 | bpc 1.06078 | |
| epoch 150 step 102590 | 376 batches | lr 0.000423 | ms/batch 5136.11 | loss 0.73 | bpc 1.04947 | |
| epoch 150 step 102600 | 386 batches | lr 0.000423 | ms/batch 5135.69 | loss 0.77 | bpc 1.10426 | |
| epoch 150 step 102610 | 396 batches | lr 0.000423 | ms/batch 5138.97 | loss 0.76 | bpc 1.08984 | |
| epoch 150 step 102620 | 406 batches | lr 0.000423 | ms/batch 5139.83 | loss 0.76 | bpc 1.10274 | |
| epoch 150 step 102630 | 416 batches | lr 0.000423 | ms/batch 5136.99 | loss 0.75 | bpc 1.07485 | |
| epoch 150 step 102640 | 426 batches | lr 0.000423 | ms/batch 5138.18 | loss 0.71 | bpc 1.02683 | |
| epoch 150 step 102650 | 436 batches | lr 0.000423 | ms/batch 5135.23 | loss 0.78 | bpc 1.12030 | |
| epoch 150 step 102660 | 446 batches | lr 0.000423 | ms/batch 5137.53 | loss 0.75 | bpc 1.08659 | |
| epoch 150 step 102670 | 456 batches | lr 0.000423 | ms/batch 5139.06 | loss 0.75 | bpc 1.08534 | |
| epoch 150 step 102680 | 466 batches | lr 0.000423 | ms/batch 5137.39 | loss 0.74 | bpc 1.07061 | |
| epoch 150 step 102690 | 476 batches | lr 0.000423 | ms/batch 5138.42 | loss 0.74 | bpc 1.06269 | |
| epoch 150 step 102700 | 486 batches | lr 0.000423 | ms/batch 5138.35 | loss 0.75 | bpc 1.08875 | |
| epoch 150 step 102710 | 496 batches | lr 0.000423 | ms/batch 5141.39 | loss 0.75 | bpc 1.07757 | |
| epoch 150 step 102720 | 506 batches | lr 0.000423 | ms/batch 5137.85 | loss 0.73 | bpc 1.05529 | |
| epoch 150 step 102730 | 516 batches | lr 0.000423 | ms/batch 5139.20 | loss 0.75 | bpc 1.07809 | |
| epoch 150 step 102740 | 526 batches | lr 0.000423 | ms/batch 5137.88 | loss 0.76 | bpc 1.09121 | |
| epoch 150 step 102750 | 536 batches | lr 0.000423 | ms/batch 5137.38 | loss 0.79 | bpc 1.13253 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 411 at step 102750 | time: 1353.82s | valid loss 0.74 | bpc 1.07328 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 150 step 102760 | 546 batches | lr 0.000423 | ms/batch 11950.87 | loss 0.74 | bpc 1.06246 | |
| epoch 150 step 102770 | 556 batches | lr 0.000423 | ms/batch 5137.11 | loss 0.76 | bpc 1.10258 | |
| epoch 150 step 102780 | 566 batches | lr 0.000423 | ms/batch 5136.90 | loss 0.77 | bpc 1.11488 | |
| epoch 150 step 102790 | 576 batches | lr 0.000423 | ms/batch 5139.94 | loss 0.73 | bpc 1.05619 | |
| epoch 150 step 102800 | 586 batches | lr 0.000423 | ms/batch 5139.04 | loss 0.74 | bpc 1.06405 | |
| epoch 150 step 102810 | 596 batches | lr 0.000423 | ms/batch 5135.58 | loss 0.77 | bpc 1.10450 | |
| epoch 150 step 102820 | 606 batches | lr 0.000423 | ms/batch 5140.13 | loss 0.73 | bpc 1.05878 | |
| epoch 150 step 102830 | 616 batches | lr 0.000423 | ms/batch 5138.46 | loss 0.73 | bpc 1.05758 | |
| epoch 150 step 102840 | 626 batches | lr 0.000423 | ms/batch 5137.87 | loss 0.75 | bpc 1.08618 | |
| epoch 150 step 102850 | 636 batches | lr 0.000423 | ms/batch 5140.59 | loss 0.76 | bpc 1.10063 | |
| epoch 150 step 102860 | 646 batches | lr 0.000423 | ms/batch 5137.29 | loss 0.76 | bpc 1.09041 | |
| epoch 150 step 102870 | 656 batches | lr 0.000423 | ms/batch 5133.88 | loss 0.78 | bpc 1.13098 | |
| epoch 150 step 102880 | 666 batches | lr 0.000423 | ms/batch 5137.35 | loss 0.77 | bpc 1.11765 | |
| epoch 150 step 102890 | 676 batches | lr 0.000423 | ms/batch 5140.09 | loss 0.70 | bpc 1.00554 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
| epoch 150 step 102900 | 686 batches | lr 0.000423 | ms/batch 5134.95 | loss 0.69 | bpc 0.99752 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 151 step 102910 | 10 batches | lr 0.000423 | ms/batch 5139.75 | loss 0.73 | bpc 1.04618 | |
| epoch 151 step 102920 | 20 batches | lr 0.000423 | ms/batch 5119.34 | loss 0.76 | bpc 1.09059 | |
| epoch 151 step 102930 | 30 batches | lr 0.000423 | ms/batch 5110.14 | loss 0.75 | bpc 1.08106 | |
| epoch 151 step 102940 | 40 batches | lr 0.000423 | ms/batch 5110.55 | loss 0.71 | bpc 1.02695 | |
| epoch 151 step 102950 | 50 batches | lr 0.000423 | ms/batch 5111.75 | loss 0.75 | bpc 1.08694 | |
| epoch 151 step 102960 | 60 batches | lr 0.000423 | ms/batch 5109.43 | loss 0.76 | bpc 1.10329 | |
| epoch 151 step 102970 | 70 batches | lr 0.000423 | ms/batch 5111.61 | loss 0.74 | bpc 1.06657 | |
| epoch 151 step 102980 | 80 batches | lr 0.000423 | ms/batch 5111.21 | loss 0.70 | bpc 1.01150 | |
| epoch 151 step 102990 | 90 batches | lr 0.000423 | ms/batch 5108.74 | loss 0.73 | bpc 1.05349 | |
| epoch 151 step 103000 | 100 batches | lr 0.000423 | ms/batch 5118.25 | loss 0.70 | bpc 1.01611 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 412 at step 103000 | time: 1351.51s | valid loss 0.75 | bpc 1.07488 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 151 step 103010 | 110 batches | lr 0.000423 | ms/batch 11952.96 | loss 0.76 | bpc 1.09961 | |
| epoch 151 step 103020 | 120 batches | lr 0.000423 | ms/batch 5139.36 | loss 0.73 | bpc 1.05529 | |
| epoch 151 step 103030 | 130 batches | lr 0.000423 | ms/batch 5134.96 | loss 0.77 | bpc 1.11581 | |
| epoch 151 step 103040 | 140 batches | lr 0.000423 | ms/batch 5140.07 | loss 0.79 | bpc 1.13598 | |
| epoch 151 step 103050 | 150 batches | lr 0.000422 | ms/batch 5137.01 | loss 0.75 | bpc 1.07724 | |
| epoch 151 step 103060 | 160 batches | lr 0.000422 | ms/batch 5137.02 | loss 0.78 | bpc 1.12880 | |
| epoch 151 step 103070 | 170 batches | lr 0.000422 | ms/batch 5140.95 | loss 0.73 | bpc 1.04954 | |
| epoch 151 step 103080 | 180 batches | lr 0.000422 | ms/batch 5138.01 | loss 0.67 | bpc 0.96959 | |
| epoch 151 step 103090 | 190 batches | lr 0.000422 | ms/batch 5139.04 | loss 0.67 | bpc 0.96959 | |
| epoch 151 step 103100 | 200 batches | lr 0.000422 | ms/batch 5139.38 | loss 0.63 | bpc 0.90700 | |
| epoch 151 step 103110 | 210 batches | lr 0.000422 | ms/batch 5131.17 | loss 0.70 | bpc 1.00334 | |
| epoch 151 step 103120 | 220 batches | lr 0.000422 | ms/batch 5135.24 | loss 0.76 | bpc 1.09996 | |
| epoch 151 step 103130 | 230 batches | lr 0.000422 | ms/batch 5140.48 | loss 0.74 | bpc 1.06099 | |
| epoch 151 step 103140 | 240 batches | lr 0.000422 | ms/batch 5139.20 | loss 0.74 | bpc 1.06185 | |
| epoch 151 step 103150 | 250 batches | lr 0.000422 | ms/batch 5134.84 | loss 0.78 | bpc 1.12509 | |
| epoch 151 step 103160 | 260 batches | lr 0.000422 | ms/batch 5132.40 | loss 0.72 | bpc 1.03717 | |
| epoch 151 step 103170 | 270 batches | lr 0.000422 | ms/batch 5132.43 | loss 0.70 | bpc 1.01367 | |
| epoch 151 step 103180 | 280 batches | lr 0.000422 | ms/batch 5136.62 | loss 0.75 | bpc 1.07602 | |
| epoch 151 step 103190 | 290 batches | lr 0.000422 | ms/batch 5132.69 | loss 0.77 | bpc 1.11605 | |
| epoch 151 step 103200 | 300 batches | lr 0.000422 | ms/batch 5137.43 | loss 0.75 | bpc 1.07852 | |
| epoch 151 step 103210 | 310 batches | lr 0.000422 | ms/batch 5137.77 | loss 0.77 | bpc 1.10769 | |
| epoch 151 step 103220 | 320 batches | lr 0.000422 | ms/batch 5135.66 | loss 0.73 | bpc 1.05498 | |
| epoch 151 step 103230 | 330 batches | lr 0.000422 | ms/batch 5137.72 | loss 0.76 | bpc 1.09529 | |
| epoch 151 step 103240 | 340 batches | lr 0.000422 | ms/batch 5138.81 | loss 0.72 | bpc 1.03785 | |
| epoch 151 step 103250 | 350 batches | lr 0.000422 | ms/batch 5138.30 | loss 0.74 | bpc 1.06289 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 413 at step 103250 | time: 1353.52s | valid loss 0.74 | bpc 1.07377 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 151 step 103260 | 360 batches | lr 0.000422 | ms/batch 11941.76 | loss 0.78 | bpc 1.13150 | |
| epoch 151 step 103270 | 370 batches | lr 0.000422 | ms/batch 5130.11 | loss 0.73 | bpc 1.04790 | |
| epoch 151 step 103280 | 380 batches | lr 0.000422 | ms/batch 5125.95 | loss 0.74 | bpc 1.07403 | |
| epoch 151 step 103290 | 390 batches | lr 0.000422 | ms/batch 5127.94 | loss 0.74 | bpc 1.07268 | |
| epoch 151 step 103300 | 400 batches | lr 0.000422 | ms/batch 5126.95 | loss 0.78 | bpc 1.13113 | |
| epoch 151 step 103310 | 410 batches | lr 0.000422 | ms/batch 5131.95 | loss 0.75 | bpc 1.08892 | |
| epoch 151 step 103320 | 420 batches | lr 0.000422 | ms/batch 5137.38 | loss 0.73 | bpc 1.04918 | |
| epoch 151 step 103330 | 430 batches | lr 0.000422 | ms/batch 5133.11 | loss 0.73 | bpc 1.05632 | |
| epoch 151 step 103340 | 440 batches | lr 0.000422 | ms/batch 5126.41 | loss 0.77 | bpc 1.10454 | |
| epoch 151 step 103350 | 450 batches | lr 0.000422 | ms/batch 5130.22 | loss 0.76 | bpc 1.09184 | |
| epoch 151 step 103360 | 460 batches | lr 0.000422 | ms/batch 5125.86 | loss 0.75 | bpc 1.08643 | |
| epoch 151 step 103370 | 470 batches | lr 0.000422 | ms/batch 5128.06 | loss 0.73 | bpc 1.04867 | |
| epoch 151 step 103380 | 480 batches | lr 0.000422 | ms/batch 5129.65 | loss 0.76 | bpc 1.09629 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0 | |
| epoch 151 step 103390 | 490 batches | lr 0.000422 | ms/batch 5126.37 | loss 0.74 | bpc 1.06864 | |
| epoch 151 step 103400 | 500 batches | lr 0.000422 | ms/batch 5121.61 | loss 0.74 | bpc 1.06400 | |
| epoch 151 step 103410 | 510 batches | lr 0.000422 | ms/batch 5130.20 | loss 0.74 | bpc 1.06747 | |
| epoch 151 step 103420 | 520 batches | lr 0.000422 | ms/batch 5129.02 | loss 0.75 | bpc 1.08288 | |
| epoch 151 step 103430 | 530 batches | lr 0.000422 | ms/batch 5128.04 | loss 0.77 | bpc 1.10727 | |
| epoch 151 step 103440 | 540 batches | lr 0.000422 | ms/batch 5130.86 | loss 0.78 | bpc 1.11936 | |
| epoch 151 step 103450 | 550 batches | lr 0.000422 | ms/batch 5125.68 | loss 0.73 | bpc 1.05748 | |
| epoch 151 step 103460 | 560 batches | lr 0.000422 | ms/batch 5126.77 | loss 0.77 | bpc 1.10379 | |
| epoch 151 step 103470 | 570 batches | lr 0.000422 | ms/batch 5132.81 | loss 0.76 | bpc 1.09157 | |
| epoch 151 step 103480 | 580 batches | lr 0.000422 | ms/batch 5126.57 | loss 0.74 | bpc 1.06090 | |
| epoch 151 step 103490 | 590 batches | lr 0.000422 | ms/batch 5129.01 | loss 0.75 | bpc 1.08078 | |
| epoch 151 step 103500 | 600 batches | lr 0.000422 | ms/batch 5129.57 | loss 0.75 | bpc 1.08055 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 414 at step 103500 | time: 1350.36s | valid loss 0.74 | bpc 1.06895 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 151 step 103510 | 610 batches | lr 0.000422 | ms/batch 11939.43 | loss 0.73 | bpc 1.05059 | |
| epoch 151 step 103520 | 620 batches | lr 0.000422 | ms/batch 5129.54 | loss 0.75 | bpc 1.08378 | |
| epoch 151 step 103530 | 630 batches | lr 0.000422 | ms/batch 5132.59 | loss 0.76 | bpc 1.10107 | |
| epoch 151 step 103540 | 640 batches | lr 0.000422 | ms/batch 5129.20 | loss 0.75 | bpc 1.08040 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 | |
| epoch 151 step 103550 | 650 batches | lr 0.000422 | ms/batch 5128.56 | loss 0.79 | bpc 1.13640 | |
| epoch 151 step 103560 | 660 batches | lr 0.000422 | ms/batch 5130.80 | loss 0.77 | bpc 1.10964 | |
| epoch 151 step 103570 | 670 batches | lr 0.000422 | ms/batch 5131.29 | loss 0.75 | bpc 1.07749 | |
| epoch 151 step 103580 | 680 batches | lr 0.000422 | ms/batch 5132.10 | loss 0.70 | bpc 1.00626 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
| epoch 152 step 103590 | 4 batches | lr 0.000422 | ms/batch 5134.33 | loss 0.70 | bpc 1.00281 | |
| epoch 152 step 103600 | 14 batches | lr 0.000422 | ms/batch 5128.49 | loss 0.74 | bpc 1.06372 | |
| epoch 152 step 103610 | 24 batches | lr 0.000422 | ms/batch 5131.75 | loss 0.77 | bpc 1.10664 | |
| epoch 152 step 103620 | 34 batches | lr 0.000422 | ms/batch 5128.04 | loss 0.73 | bpc 1.04626 | |
| epoch 152 step 103630 | 44 batches | lr 0.000422 | ms/batch 5130.41 | loss 0.72 | bpc 1.03885 | |
| epoch 152 step 103640 | 54 batches | lr 0.000422 | ms/batch 5129.95 | loss 0.77 | bpc 1.10548 | |
| epoch 152 step 103650 | 64 batches | lr 0.000422 | ms/batch 5131.20 | loss 0.75 | bpc 1.07841 | |
| epoch 152 step 103660 | 74 batches | lr 0.000422 | ms/batch 5131.42 | loss 0.74 | bpc 1.06178 | |
| epoch 152 step 103670 | 84 batches | lr 0.000422 | ms/batch 5131.41 | loss 0.70 | bpc 1.00912 | |
| epoch 152 step 103680 | 94 batches | lr 0.000422 | ms/batch 5131.65 | loss 0.72 | bpc 1.03452 | |
| epoch 152 step 103690 | 104 batches | lr 0.000422 | ms/batch 5130.02 | loss 0.74 | bpc 1.06794 | |
| epoch 152 step 103700 | 114 batches | lr 0.000422 | ms/batch 5135.04 | loss 0.74 | bpc 1.07193 | |
| epoch 152 step 103710 | 124 batches | lr 0.000422 | ms/batch 5132.36 | loss 0.76 | bpc 1.09026 | |
| epoch 152 step 103720 | 134 batches | lr 0.000422 | ms/batch 5128.96 | loss 0.76 | bpc 1.09308 | |
| epoch 152 step 103730 | 144 batches | lr 0.000422 | ms/batch 5133.66 | loss 0.78 | bpc 1.12686 | |
| epoch 152 step 103740 | 154 batches | lr 0.000422 | ms/batch 5132.68 | loss 0.76 | bpc 1.09769 | |
| epoch 152 step 103750 | 164 batches | lr 0.000421 | ms/batch 5130.79 | loss 0.78 | bpc 1.13182 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 415 at step 103750 | time: 1352.20s | valid loss 0.74 | bpc 1.06743 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 152 step 103760 | 174 batches | lr 0.000421 | ms/batch 11942.94 | loss 0.68 | bpc 0.97746 | |
| epoch 152 step 103770 | 184 batches | lr 0.000421 | ms/batch 5130.02 | loss 0.68 | bpc 0.97962 | |
| epoch 152 step 103780 | 194 batches | lr 0.000421 | ms/batch 5134.07 | loss 0.69 | bpc 0.98881 | |
| epoch 152 step 103790 | 204 batches | lr 0.000421 | ms/batch 5130.92 | loss 0.62 | bpc 0.88906 | |
| epoch 152 step 103800 | 214 batches | lr 0.000421 | ms/batch 5129.07 | loss 0.73 | bpc 1.04803 | |
| epoch 152 step 103810 | 224 batches | lr 0.000421 | ms/batch 5130.92 | loss 0.74 | bpc 1.06807 | |
| epoch 152 step 103820 | 234 batches | lr 0.000421 | ms/batch 5131.33 | loss 0.74 | bpc 1.06042 | |
| epoch 152 step 103830 | 244 batches | lr 0.000421 | ms/batch 5133.58 | loss 0.75 | bpc 1.08574 | |
| epoch 152 step 103840 | 254 batches | lr 0.000421 | ms/batch 5132.21 | loss 0.78 | bpc 1.11812 | |
| epoch 152 step 103850 | 264 batches | lr 0.000421 | ms/batch 5132.23 | loss 0.70 | bpc 1.00757 | |
| epoch 152 step 103860 | 274 batches | lr 0.000421 | ms/batch 5131.37 | loss 0.72 | bpc 1.03820 | |
| epoch 152 step 103870 | 284 batches | lr 0.000421 | ms/batch 5130.25 | loss 0.77 | bpc 1.10533 | |
| epoch 152 step 103880 | 294 batches | lr 0.000421 | ms/batch 5131.31 | loss 0.76 | bpc 1.09128 | |
| epoch 152 step 103890 | 304 batches | lr 0.000421 | ms/batch 5132.62 | loss 0.75 | bpc 1.08889 | |
| epoch 152 step 103900 | 314 batches | lr 0.000421 | ms/batch 5130.52 | loss 0.76 | bpc 1.10196 | |
| epoch 152 step 103910 | 324 batches | lr 0.000421 | ms/batch 5125.55 | loss 0.74 | bpc 1.07314 | |
| epoch 152 step 103920 | 334 batches | lr 0.000421 | ms/batch 5113.13 | loss 0.75 | bpc 1.08465 | |
| epoch 152 step 103930 | 344 batches | lr 0.000421 | ms/batch 5111.69 | loss 0.70 | bpc 1.01701 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 | |
| epoch 152 step 103940 | 354 batches | lr 0.000421 | ms/batch 5110.87 | loss 0.77 | bpc 1.11808 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 65536.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 65536.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 65536.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 65536.0 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 103950 | 364 batches | lr 0.000421 | ms/batch 5110.69 | loss nan | bpc nan | |
| epoch 152 step 103960 | 374 batches | lr 0.000421 | ms/batch 5109.98 | loss 0.72 | bpc 1.03865 | |
| epoch 152 step 103970 | 384 batches | lr 0.000421 | ms/batch 5112.27 | loss 0.76 | bpc 1.09533 | |
| epoch 152 step 103980 | 394 batches | lr 0.000421 | ms/batch 5111.61 | loss 0.75 | bpc 1.08365 | |
| epoch 152 step 103990 | 404 batches | lr 0.000421 | ms/batch 5110.18 | loss 0.77 | bpc 1.10594 | |
| epoch 152 step 104000 | 414 batches | lr 0.000421 | ms/batch 5121.36 | loss 0.75 | bpc 1.08615 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
| Eval 416 at step 104000 | time: 1349.28s | valid loss 0.74 | bpc 1.06833 | |
---------------------------------------------------------------------------------------------------- | |
| epoch 152 step 104010 | 424 batches | lr 0.000421 | ms/batch 11948.50 | loss 0.72 | bpc 1.03598 | |
| epoch 152 step 104020 | 434 batches | lr 0.000421 | ms/batch 5138.66 | loss 0.76 | bpc 1.09152 | |
| epoch 152 step 104030 | 444 batches | lr 0.000421 | ms/batch 5134.95 | loss 0.76 | bpc 1.09304 | |
| epoch 152 step 104040 | 454 batches | lr 0.000421 | ms/batch 5133.48 | loss 0.77 | bpc 1.10379 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104050 | 464 batches | lr 0.000421 | ms/batch 5126.16 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104060 | 474 batches | lr 0.000421 | ms/batch 5123.08 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104070 | 484 batches | lr 0.000421 | ms/batch 5121.17 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07 | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07 | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07 | |
| epoch 152 step 104080 | 494 batches | lr 0.000421 | ms/batch 5116.92 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10| epoch 152 step 104090 | 504 batches | lr 0.000421 | ms/batch 5115.58 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104100 | 514 batches | lr 0.000421 | ms/batch 5120.08 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104110 | 524 batches | lr 0.000421 | ms/batch 5118.74 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6940658945086007e-21 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6940658945086007e-21 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6940658945086007e-21 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6940658945086007e-21 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.235164736271502e-22 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.235164736271502e-22 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.235164736271502e-22 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.235164736271502e-22 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.117582368135751e-22 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.117582368135751e-22 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.117582368135751e-22 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.117582368135751e-22 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.293955920339377e-23 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.293955920339377e-23 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.293955920339377e-23 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.293955920339377e-23 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3234889800848443e-23 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3234889800848443e-23 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3234889800848443e-23 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104120 | 534 batches | lr 0.000421 | ms/batch 5117.64 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3234889800848443e-23 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.308722450212111e-24 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.308722450212111e-24 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.308722450212111e-24 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.308722450212111e-24 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.271806125530277e-25 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.271806125530277e-25 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.271806125530277e-25 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.271806125530277e-25 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0679515313825692e-25 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0679515313825692e-25 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0679515313825692e-25 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0679515313825692e-25 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.169878828456423e-26 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.169878828456423e-26 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.169878828456423e-26 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.169878828456423e-26 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5849394142282115e-26 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5849394142282115e-26 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5849394142282115e-26 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5849394142282115e-26 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.462348535570529e-27 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.462348535570529e-27 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.462348535570529e-27 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.462348535570529e-27 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6155871338926322e-27 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6155871338926322e-27 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6155871338926322e-27 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6155871338926322e-27 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104130 | 544 batches | lr 0.000421 | ms/batch 5124.81 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0389678347315804e-28 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0389678347315804e-28 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0389678347315804e-28 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0389678347315804e-28 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0097419586828951e-28 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0097419586828951e-28 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0097419586828951e-28 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0097419586828951e-28 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.048709793414476e-29 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.048709793414476e-29 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.048709793414476e-29 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.048709793414476e-29 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.262177448353619e-29 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.262177448353619e-29 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.262177448353619e-29 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.262177448353619e-29 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.310887241768095e-30 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.310887241768095e-30 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.310887241768095e-30 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.310887241768095e-30 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5777218104420236e-30 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5777218104420236e-30 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5777218104420236e-30 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5777218104420236e-30 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.888609052210118e-31 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.888609052210118e-31 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.888609052210118e-31Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.888609052210118e-31 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9721522630525295e-31 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9721522630525295e-31 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9721522630525295e-31 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104140 | 554 batches | lr 0.000421 | ms/batch 5121.15 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9721522630525295e-31 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.930380657631324e-32 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.930380657631324e-32 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.930380657631324e-32 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.930380657631324e-32 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.232595164407831e-32 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.232595164407831e-32 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.232595164407831e-32 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.232595164407831e-32 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0814879110195774e-33 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0814879110195774e-33 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0814879110195774e-33 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0814879110195774e-33 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5407439555097887e-33 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5407439555097887e-33 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5407439555097887e-33 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5407439555097887e-33 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.703719777548943e-34 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.703719777548943e-34 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.703719777548943e-34 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.703719777548943e-34 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.851859888774472e-34 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.851859888774472e-34 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.851859888774472e-34 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.851859888774472e-34 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.925929944387236e-34 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.925929944387236e-34 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.925929944387236e-34 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.925929944387236e-34 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.81482486096809e-35 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.81482486096809e-35Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.81482486096809e-35 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104150 | 564 batches | lr 0.000421 | ms/batch 5124.05 | loss nan | bpc nanGradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.81482486096809e-35 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2037062152420224e-35 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2037062152420224e-35 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2037062152420224e-35 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2037062152420224e-35 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.009265538105056e-36 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.009265538105056e-36 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.009265538105056e-36 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.009265538105056e-36 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.504632769052528e-36 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.504632769052528e-36 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.504632769052528e-36 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.504632769052528e-36 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.76158192263132e-37 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.76158192263132e-37 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.76158192263132e-37 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.76158192263132e-37 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.88079096131566e-37 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.88079096131566e-37 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.88079096131566e-37 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.88079096131566e-37 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.70197740328915e-38 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.70197740328915e-38 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.70197740328915e-38 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.70197740328915e-38 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1754943508222875e-38 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1754943508222875e-38 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1754943508222875e-38 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1754943508222875e-38 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104160 | 574 batches | lr 0.000421 | ms/batch 5122.74 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.938735877055719e-39 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.938735877055719e-39 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.938735877055719e-39 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.938735877055719e-39 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4693679385278594e-39Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4693679385278594e-39 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4693679385278594e-39 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4693679385278594e-39 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6734198463196485e-40 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6734198463196485e-40 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6734198463196485e-40Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6734198463196485e-40 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.183549615799121e-41Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.183549615799121e-41 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.183549615799121e-41 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.183549615799121e-41 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2958874039497803e-41 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2958874039497803e-41 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2958874039497803e-41 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2958874039497803e-41 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1479437019748901e-41 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1479437019748901e-41 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1479437019748901e-41 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1479437019748901e-41 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8698592549372254e-42 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8698592549372254e-42 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8698592549372254e-42 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8698592549372254e-42 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.174648137343064e-43 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.174648137343064e-43 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.174648137343064e-43 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.174648137343064e-43 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.587324068671532e-43Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.587324068671532e-43 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.587324068671532e-43 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.587324068671532e-43 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.96831017167883e-44 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.96831017167883e-44 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.96831017167883e-44 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104170 | 584 batches | lr 0.000421 | ms/batch 5117.21 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.96831017167883e-44 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2420775429197073e-44 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2420775429197073e-44 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2420775429197073e-44 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2420775429197073e-44 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.605193857299268e-45 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.605193857299268e-45 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.605193857299268e-45 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.605193857299268e-45 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.401298464324817e-45Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.401298464324817e-45 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.401298464324817e-45 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.401298464324817e-45 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.503246160812043e-46 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.503246160812043e-46 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.503246160812043e-46 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.503246160812043e-46 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.758115402030107e-47 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.758115402030107e-47 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.758115402030107e-47 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.758115402030107e-47 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3790577010150533e-47 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3790577010150533e-47 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3790577010150533e-47 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3790577010150533e-47 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1895288505075267e-47 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1895288505075267e-47 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1895288505075267e-47 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1895288505075267e-47 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.473822126268817e-48 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.473822126268817e-48Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.473822126268817e-48 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.473822126268817e-48 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3684555315672042e-48 | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3684555315672042e-48 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3684555315672042e-48 | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3684555315672042e-48| epoch 152 step 104180 | 594 batches | lr 0.000421 | ms/batch 5121.18 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4211388289180104e-49 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4211388289180104e-49 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4211388289180104e-49 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4211388289180104e-49 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.552847072295026e-50 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.552847072295026e-50 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.552847072295026e-50 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.552847072295026e-50 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1382117680737565e-50 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1382117680737565e-50 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1382117680737565e-50 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1382117680737565e-50 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.345529420184391e-51 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.345529420184391e-51 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.345529420184391e-51 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.345529420184391e-51 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3363823550460978e-51 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3363823550460978e-51 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3363823550460978e-51 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3363823550460978e-51 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3409558876152446e-52 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3409558876152446e-52 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3409558876152446e-52 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3409558876152446e-52 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.352389719038111e-53 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.352389719038111e-53 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.352389719038111e-53 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.352389719038111e-53 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.088097429759528e-53 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.088097429759528e-53 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.088097429759528e-53 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.088097429759528e-53 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.22024357439882e-54 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.22024357439882e-54 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.22024357439882e-54 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.22024357439882e-54 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.305060893599705e-54 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.305060893599705e-54 | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.305060893599705e-54 | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104190 | 604 batches | lr 0.000421 | ms/batch 5119.10 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.305060893599705e-54 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2626522339992623e-55 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2626522339992623e-55 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2626522339992623e-55 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2626522339992623e-55 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.156630584998156e-56 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.156630584998156e-56 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.156630584998156e-56 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.156630584998156e-56 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.039157646249539e-56 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.039157646249539e-56 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.039157646249539e-56 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.039157646249539e-56 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0978941156238473e-57 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0978941156238473e-57 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0978941156238473e-57 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0978941156238473e-57 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2744735289059618e-57 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2744735289059618e-57 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2744735289059618e-57 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2744735289059618e-57 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1861838222649046e-58 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1861838222649046e-58 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1861838222649046e-58 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1861838222649046e-58 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.965459555662261e-59 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.965459555662261e-59 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.965459555662261e-59 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.965459555662261e-59 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9913648889155653e-59 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9913648889155653e-59 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9913648889155653e-59 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9913648889155653e-59 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9784122222889134e-60 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9784122222889134e-60 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9784122222889134e-60 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9784122222889134e-60 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2446030555722283e-60 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2446030555722283e-60 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2446030555722283e-60 | |
| epoch 152 step 104200 | 614 batches | lr 0.000421 | ms/batch 5120.82 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2446030555722283e-60 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.111507638930571e-61 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.111507638930571e-61Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.111507638930571e-61 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.111507638930571e-61 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.778769097326427e-62 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.778769097326427e-62 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.778769097326427e-62 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.778769097326427e-62 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9446922743316068e-62 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9446922743316068e-62 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9446922743316068e-62 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9446922743316068e-62 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.861730685829017e-63 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.861730685829017e-63 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.861730685829017e-63 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.861730685829017e-63 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2154326714572542e-63 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2154326714572542e-63 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2154326714572542e-63 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2154326714572542e-63 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0385816786431356e-64 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0385816786431356e-64 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0385816786431356e-64 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0385816786431356e-64 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.596454196607839e-65 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.596454196607839e-65 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.596454196607839e-65 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.596454196607839e-65 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8991135491519597e-65 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8991135491519597e-65 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8991135491519597e-65 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8991135491519597e-65 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.7477838728798994e-66 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.7477838728798994e-66 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.7477838728798994e-66 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.7477838728798994e-66 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1869459682199748e-66 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1869459682199748e-66 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1869459682199748e-66 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104210 | 624 batches | lr 0.000421 | ms/batch 5120.74 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1869459682199748e-66 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.967364920549937e-67 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.967364920549937e-67 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.967364920549937e-67 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.967364920549937e-67 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.418412301374843e-68 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.418412301374843e-68 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.418412301374843e-68 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.418412301374843e-68 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8546030753437107e-68 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8546030753437107e-68 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8546030753437107e-68 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8546030753437107e-68 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.636507688359277e-69 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.636507688359277e-69 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.636507688359277e-69 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.636507688359277e-69 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1591269220898192e-69 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1591269220898192e-69 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1591269220898192e-69 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1591269220898192e-69 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.897817305224548e-70 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.897817305224548e-70 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.897817305224548e-70 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.897817305224548e-70 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.24454326306137e-71 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.24454326306137e-71 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.24454326306137e-71 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.24454326306137e-71 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8111358157653425e-71 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8111358157653425e-71 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8111358157653425e-71 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8111358157653425e-71 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.527839539413356e-72 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.527839539413356e-72 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.527839539413356e-72 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.527839539413356e-72 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.131959884853339e-72 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.131959884853339e-72 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.131959884853339e-72 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104220 | 634 batches | lr 0.000421 | ms/batch 5123.10 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.131959884853339e-72 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8298997121333476e-73 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8298997121333476e-73 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8298997121333476e-73 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8298997121333476e-73 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.074749280333369e-74 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.074749280333369e-74 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.074749280333369e-74 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.074749280333369e-74 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7686873200833423e-74 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7686873200833423e-74 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7686873200833423e-74 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7686873200833423e-74 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.421718300208356e-75 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.421718300208356e-75 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.421718300208356e-75 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.421718300208356e-75 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.105429575052089e-75 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.105429575052089e-75 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.105429575052089e-75 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.105429575052089e-75 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7635739376302223e-76 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7635739376302223e-76 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7635739376302223e-76 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7635739376302223e-76 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.908934844075556e-77 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.908934844075556e-77 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.908934844075556e-77 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.908934844075556e-77 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.727233711018889e-77 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.727233711018889e-77 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.727233711018889e-77 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.727233711018889e-77 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3180842775472223e-78 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3180842775472223e-78Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3180842775472223e-78 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3180842775472223e-78 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0795210693868056e-78 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0795210693868056e-78 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0795210693868056e-78 | |
| epoch 152 step 104230 | 644 batches | lr 0.000421 | ms/batch 5120.38 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0795210693868056e-78 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.698802673467014e-79 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.698802673467014e-79 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.698802673467014e-79 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.698802673467014e-79 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.747006683667535e-80 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.747006683667535e-80 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.747006683667535e-80 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.747006683667535e-80 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6867516709168837e-80 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6867516709168837e-80 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6867516709168837e-80Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6867516709168837e-80 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.2168791772922093e-81 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.2168791772922093e-81 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.2168791772922093e-81 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.2168791772922093e-81 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0542197943230523e-81 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0542197943230523e-81 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0542197943230523e-81 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0542197943230523e-81 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.635549485807631e-82 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.635549485807631e-82 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.635549485807631e-82 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.635549485807631e-82 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.588873714519077e-83 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.588873714519077e-83 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.588873714519077e-83 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.588873714519077e-83 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6472184286297693e-83 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6472184286297693e-83 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6472184286297693e-83 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6472184286297693e-83 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.118046071574423e-84 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.118046071574423e-84 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.118046071574423e-84 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.118046071574423e-84 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0295115178936058e-84 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0295115178936058e-84 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0295115178936058e-84 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104240 | 654 batches | lr 0.000421 | ms/batch 5122.52 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0295115178936058e-84 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5737787947340145e-85 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5737787947340145e-85 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5737787947340145e-85 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5737787947340145e-85 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.434446986835036e-86 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.434446986835036e-86 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.434446986835036e-86 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.434446986835036e-86 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.608611746708759e-86 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.608611746708759e-86 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.608611746708759e-86 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.608611746708759e-86 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.021529366771898e-87 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.021529366771898e-87 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.021529366771898e-87 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.021529366771898e-87 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0053823416929744e-87 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0053823416929744e-87 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0053823416929744e-87 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0053823416929744e-87 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.513455854232436e-88Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.513455854232436e-88 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.513455854232436e-88 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.513455854232436e-88 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.28363963558109e-89 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.28363963558109e-89 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.28363963558109e-89 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.28363963558109e-89 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5709099088952725e-89 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5709099088952725e-89 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5709099088952725e-89 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5709099088952725e-89 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9272747722381812e-90 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9272747722381812e-90 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9272747722381812e-90 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9272747722381812e-90 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.818186930595453e-91 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.818186930595453e-91 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.818186930595453e-91 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104250 | 664 batches | lr 0.000421 | ms/batch 5118.77 | loss nan | bpc nanGradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.818186930595453e-91 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| Eval 417 at step 104250 | time: 1355.27s | valid loss nan | bpc nan | |
---------------------------------------------------------------------------------------------------- | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4545467326488633e-91 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4545467326488633e-91 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4545467326488633e-91Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4545467326488633e-91 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.136366831622158e-92 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.136366831622158e-92 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.136366831622158e-92 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.136366831622158e-92 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5340917079055395e-92 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5340917079055395e-92 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5340917079055395e-92 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5340917079055395e-92 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.835229269763849e-93 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.835229269763849e-93 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.835229269763849e-93 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.835229269763849e-93 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.588073174409622e-94 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.588073174409622e-94 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.588073174409622e-94 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.588073174409622e-94 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3970182936024055e-94 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3970182936024055e-94 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3970182936024055e-94 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3970182936024055e-94 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.992545734006014e-95 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.992545734006014e-95 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.992545734006014e-95 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.992545734006014e-95 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4981364335015035e-95 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4981364335015035e-95Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4981364335015035e-95 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4981364335015035e-95 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.745341083753759e-96 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.745341083753759e-96 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.745341083753759e-96 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.745341083753759e-96 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.363352709384397e-97 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104260 | 674 batches | lr 0.000421 | ms/batch 11943.73 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.363352709384397e-97 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.363352709384397e-97 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.363352709384397e-97 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3408381773460992e-97 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3408381773460992e-97 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3408381773460992e-97 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3408381773460992e-97 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.852095443365248e-98 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.852095443365248e-98 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.852095443365248e-98 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.852095443365248e-98 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.463023860841312e-98 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.463023860841312e-98 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.463023860841312e-98 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.463023860841312e-98 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.65755965210328e-99 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.65755965210328e-99 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.65755965210328e-99 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.65755965210328e-99 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.1438991302582e-100 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.1438991302582e-100 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.1438991302582e-100 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.1438991302582e-100 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.28597478256455e-100 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.28597478256455e-100 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.28597478256455e-100 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.28597478256455e-100 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.714936956411375e-101 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.714936956411375e-101 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.714936956411375e-101 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.714936956411375e-101 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4287342391028437e-101 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4287342391028437e-101 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4287342391028437e-101 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4287342391028437e-101 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5718355977571093e-102 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5718355977571093e-102 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5718355977571093e-102 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5718355977571093e-102 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.929588994392773e-103 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.929588994392773e-103 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.929588994392773e-103 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 152 step 104270 | 684 batches | lr 0.000421 | ms/batch 5119.44 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.929588994392773e-103 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2323972485981933e-103 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2323972485981933e-103 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2323972485981933e-103Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2323972485981933e-103 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5809931214954833e-104 | |
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5809931214954833e-104 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5809931214954833e-104 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5809931214954833e-104 | |
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3952482803738708e-104 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3952482803738708e-104 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3952482803738708e-104 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3952482803738708e-104 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.488120700934677e-105 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.488120700934677e-105 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.488120700934677e-105 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.488120700934677e-105 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.720301752336693e-106 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.720301752336693e-106 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.720301752336693e-106 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.720301752336693e-106 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1800754380841732e-106 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1800754380841732e-106 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1800754380841732e-106 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1800754380841732e-106 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.450188595210433e-107 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.450188595210433e-107 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.450188595210433e-107 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.450188595210433e-107 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3625471488026082e-107 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3625471488026082e-107 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3625471488026082e-107 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3625471488026082e-107 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4063678720065206e-108 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4063678720065206e-108 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4063678720065206e-108 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4063678720065206e-108 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.515919680016301e-109 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.515919680016301e-109 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.515919680016301e-109Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.515919680016301e-109 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104280 | 8 batches | lr 0.000421 | ms/batch 5118.13 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1289799200040754e-109 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1289799200040754e-109 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1289799200040754e-109 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1289799200040754e-109 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.3224498000101884e-110 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.3224498000101884e-110 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.3224498000101884e-110 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.3224498000101884e-110 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3306124500025471e-110 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3306124500025471e-110 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3306124500025471e-110 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3306124500025471e-110 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.326531125006368e-111 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.326531125006368e-111 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.326531125006368e-111 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.326531125006368e-111 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.31632781251592e-112 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.31632781251592e-112 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.31632781251592e-112 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.31632781251592e-112 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.07908195312898e-112 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.07908195312898e-112 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.07908195312898e-112 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.07908195312898e-112 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.19770488282245e-113 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.19770488282245e-113 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.19770488282245e-113 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.19770488282245e-113 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2994262207056124e-113 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2994262207056124e-113 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2994262207056124e-113 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2994262207056124e-113 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.248565551764031e-114 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.248565551764031e-114 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.248565551764031e-114 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.248565551764031e-114 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.121413879410078e-115 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.121413879410078e-115 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.121413879410078e-115 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104290 | 18 batches | lr 0.000421 | ms/batch 5118.66 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.121413879410078e-115 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0303534698525194e-115 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0303534698525194e-115 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0303534698525194e-115 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0303534698525194e-115 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.075883674631299e-116 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.075883674631299e-116 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.075883674631299e-116 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.075883674631299e-116 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2689709186578246e-116 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2689709186578246e-116 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2689709186578246e-116 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2689709186578246e-116 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1724272966445615e-117 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1724272966445615e-117 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1724272966445615e-117 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1724272966445615e-117 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.931068241611404e-118 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.931068241611404e-118 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.931068241611404e-118 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.931068241611404e-118 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.982767060402851e-118 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.982767060402851e-118 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.982767060402851e-118 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.982767060402851e-118 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9569176510071274e-119 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9569176510071274e-119 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9569176510071274e-119 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9569176510071274e-119 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2392294127517818e-119 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2392294127517818e-119 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2392294127517818e-119 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2392294127517818e-119 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0980735318794546e-120 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0980735318794546e-120 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0980735318794546e-120 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0980735318794546e-120 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.745183829698637e-121 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.745183829698637e-121 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.745183829698637e-121 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.745183829698637e-121 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104300 | 28 batches | lr 0.000421 | ms/batch 5117.96 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9362959574246591e-121 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9362959574246591e-121 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9362959574246591e-121 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9362959574246591e-121 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.840739893561648e-122 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.840739893561648e-122 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.840739893561648e-122 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.840739893561648e-122 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.210184973390412e-122 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.210184973390412e-122 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.210184973390412e-122 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.210184973390412e-122 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.02546243347603e-123 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.02546243347603e-123 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.02546243347603e-123 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.02546243347603e-123 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.563656083690075e-124 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.563656083690075e-124 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.563656083690075e-124 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.563656083690075e-124 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8909140209225187e-124 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8909140209225187e-124 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8909140209225187e-124 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8909140209225187e-124 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.727285052306297e-125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.727285052306297e-125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.727285052306297e-125Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.727285052306297e-125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1818212630765742e-125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1818212630765742e-125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1818212630765742e-125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1818212630765742e-125 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9545531576914354e-126 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9545531576914354e-126 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9545531576914354e-126 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9545531576914354e-126 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.386382894228589e-127 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.386382894228589e-127 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.386382894228589e-127 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104310 | 38 batches | lr 0.000421 | ms/batch 5120.30 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.386382894228589e-127 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8465957235571472e-127 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8465957235571472e-127 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8465957235571472e-127 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8465957235571472e-127 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.616489308892868e-128 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.616489308892868e-128 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.616489308892868e-128 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.616489308892868e-128 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.154122327223217e-128 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.154122327223217e-128 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.154122327223217e-128 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.154122327223217e-128 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8853058180580424e-129 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8853058180580424e-129 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8853058180580424e-129 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8853058180580424e-129 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.213264545145106e-130 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.213264545145106e-130 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.213264545145106e-130 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.213264545145106e-130 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8033161362862765e-130 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8033161362862765e-130Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8033161362862765e-130 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8033161362862765e-130 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5082903407156913e-131 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5082903407156913e-131 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5082903407156913e-131 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5082903407156913e-131 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1270725851789228e-131 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1270725851789228e-131 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1270725851789228e-131 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1270725851789228e-131 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.817681462947307e-132 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.817681462947307e-132 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.817681462947307e-132 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.817681462947307e-132 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.044203657368268e-133 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.044203657368268e-133 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104320 | 48 batches | lr 0.000421 | ms/batch 5119.29 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.044203657368268e-133 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.044203657368268e-133 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.761050914342067e-133 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.761050914342067e-133 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.761050914342067e-133 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.761050914342067e-133 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4026272858551673e-134 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4026272858551673e-134 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4026272858551673e-134 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4026272858551673e-134 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1006568214637918e-134 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1006568214637918e-134 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1006568214637918e-134 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1006568214637918e-134 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7516420536594796e-135 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7516420536594796e-135 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7516420536594796e-135 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7516420536594796e-135 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.879105134148699e-136 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.879105134148699e-136 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.879105134148699e-136 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.879105134148699e-136 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7197762835371747e-136 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7197762835371747e-136 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7197762835371747e-136 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7197762835371747e-136 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.299440708842937e-137 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.299440708842937e-137 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.299440708842937e-137 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.299440708842937e-137 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0748601772107342e-137 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0748601772107342e-137 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0748601772107342e-137 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0748601772107342e-137 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6871504430268355e-138 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6871504430268355e-138 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6871504430268355e-138 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6871504430268355e-138 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.717876107567089e-139 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.717876107567089e-139 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104330 | 58 batches | lr 0.000421 | ms/batch 5117.85 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.717876107567089e-139 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.717876107567089e-139 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6794690268917722e-139 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6794690268917722e-139 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6794690268917722e-139 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6794690268917722e-139 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.1986725672294305e-140 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.1986725672294305e-140 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.1986725672294305e-140 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.1986725672294305e-140 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0496681418073576e-140Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0496681418073576e-140 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0496681418073576e-140 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0496681418073576e-140 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.624170354518394e-141 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.624170354518394e-141 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.624170354518394e-141 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.624170354518394e-141 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.560425886295985e-142Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.560425886295985e-142 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.560425886295985e-142 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.560425886295985e-142 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6401064715739963e-142 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6401064715739963e-142 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6401064715739963e-142 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6401064715739963e-142 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.100266178934991e-143 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.100266178934991e-143 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.100266178934991e-143 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.100266178934991e-143 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0250665447337477e-143 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0250665447337477e-143 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0250665447337477e-143 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0250665447337477e-143 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5626663618343692e-144 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5626663618343692e-144Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5626663618343692e-144 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5626663618343692e-144 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.406665904585923e-145Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.406665904585923e-145Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.406665904585923e-145 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104340 | 68 batches | lr 0.000421 | ms/batch 5119.16 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.406665904585923e-145 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6016664761464807e-145 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6016664761464807e-145 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6016664761464807e-145 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6016664761464807e-145 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.004166190366202e-146Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.004166190366202e-146 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.004166190366202e-146 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.004166190366202e-146 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0010415475915505e-146 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0010415475915505e-146 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0010415475915505e-146 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0010415475915505e-146 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5026038689788762e-147 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5026038689788762e-147 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5026038689788762e-147 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5026038689788762e-147 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.256509672447191e-148 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.256509672447191e-148 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.256509672447191e-148 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.256509672447191e-148 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5641274181117976e-148 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5641274181117976e-148 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5641274181117976e-148 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5641274181117976e-148 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.910318545279494e-149 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.910318545279494e-149 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.910318545279494e-149 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.910318545279494e-149 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.775796363198735e-150 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.775796363198735e-150 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.775796363198735e-150 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.775796363198735e-150 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.443949090799684e-150 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.443949090799684e-150 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.443949090799684e-150 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.443949090799684e-150 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.10987272699921e-151 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.10987272699921e-151 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104350 | 78 batches | lr 0.000421 | ms/batch 5117.98 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.10987272699921e-151 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.10987272699921e-151 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5274681817498023e-151Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5274681817498023e-151 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5274681817498023e-151 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5274681817498023e-151 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.818670454374506e-152 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.818670454374506e-152 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.818670454374506e-152 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.818670454374506e-152 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.546676135936265e-153 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.546676135936265e-153 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.546676135936265e-153 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.546676135936265e-153 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3866690339840662e-153 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3866690339840662e-153 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3866690339840662e-153 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3866690339840662e-153 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.966672584960166e-154 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.966672584960166e-154 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.966672584960166e-154 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.966672584960166e-154 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4916681462400413e-154 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4916681462400413e-154 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4916681462400413e-154 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4916681462400413e-154 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7291703656001034e-155 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7291703656001034e-155 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7291703656001034e-155 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7291703656001034e-155 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.322925914000258e-156 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.322925914000258e-156 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.322925914000258e-156 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.322925914000258e-156 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3307314785000646e-156Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3307314785000646e-156 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3307314785000646e-156 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3307314785000646e-156 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.826828696250162e-157Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.826828696250162e-157 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.826828696250162e-157 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.826828696250162e-157 | |
| epoch 153 step 104360 | 88 batches | lr 0.000421 | ms/batch 5117.80 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4567071740625404e-157 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4567071740625404e-157 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4567071740625404e-157 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4567071740625404e-157 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.641767935156351e-158 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.641767935156351e-158 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.641767935156351e-158 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.641767935156351e-158 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.104419837890877e-159 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.104419837890877e-159 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.104419837890877e-159 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.104419837890877e-159 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2761049594727193e-159 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2761049594727193e-159 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2761049594727193e-159 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2761049594727193e-159 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.6902623986817984e-160 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.6902623986817984e-160 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.6902623986817984e-160 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.6902623986817984e-160 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4225655996704496e-160 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4225655996704496e-160 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4225655996704496e-160 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4225655996704496e-160 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.556413999176124e-161 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.556413999176124e-161 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.556413999176124e-161 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.556413999176124e-161 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.89103499794031e-162Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.89103499794031e-162 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.89103499794031e-162 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.89103499794031e-162 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2227587494850775e-162 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2227587494850775e-162 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2227587494850775e-162 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2227587494850775e-162 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.556896873712694e-163 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.556896873712694e-163 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.556896873712694e-163Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.556896873712694e-163 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104370 | 98 batches | lr 0.000421 | ms/batch 5119.46 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3892242184281734e-163 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3892242184281734e-163 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3892242184281734e-163 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3892242184281734e-163 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4730605460704336e-164Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4730605460704336e-164 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4730605460704336e-164 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4730605460704336e-164 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.682651365176084e-165 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.682651365176084e-165 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.682651365176084e-165 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.682651365176084e-165 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.170662841294021e-165 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.170662841294021e-165 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.170662841294021e-165 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.170662841294021e-165 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.426657103235053e-166 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.426657103235053e-166 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.426657103235053e-166 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.426657103235053e-166 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3566642758087631e-166 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3566642758087631e-166 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3566642758087631e-166 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3566642758087631e-166 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.391660689521908e-167 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.391660689521908e-167 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.391660689521908e-167 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.391660689521908e-167 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.47915172380477e-168 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.47915172380477e-168 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.47915172380477e-168 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.47915172380477e-168 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1197879309511924e-168 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1197879309511924e-168 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1197879309511924e-168 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1197879309511924e-168 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.299469827377981e-169 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.299469827377981e-169 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.299469827377981e-169 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.299469827377981e-169 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104380 | 108 batches | lr 0.000421 | ms/batch 5121.43 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3248674568444952e-169 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3248674568444952e-169 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3248674568444952e-169Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3248674568444952e-169 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.312168642111238e-170Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.312168642111238e-170 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.312168642111238e-170 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.312168642111238e-170 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.280421605278095e-171 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.280421605278095e-171 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.280421605278095e-171 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.280421605278095e-171 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.070105401319524e-171 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.070105401319524e-171 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.070105401319524e-171 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.070105401319524e-171 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.17526350329881e-172 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.17526350329881e-172 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.17526350329881e-172 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.17526350329881e-172 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2938158758247024e-172Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2938158758247024e-172Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2938158758247024e-172 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2938158758247024e-172 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.234539689561756e-173 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.234539689561756e-173 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.234539689561756e-173Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.234539689561756e-173 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.08634922390439e-174 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.08634922390439e-174 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.08634922390439e-174 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.08634922390439e-174 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0215873059760975e-174 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0215873059760975e-174 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0215873059760975e-174 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0215873059760975e-174 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.053968264940244e-175 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.053968264940244e-175 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.053968264940244e-175 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104390 | 118 batches | lr 0.000421 | ms/batch 5118.86 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.053968264940244e-175 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.263492066235061e-175 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.263492066235061e-175 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.263492066235061e-175 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.263492066235061e-175 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1587301655876523e-176 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1587301655876523e-176 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1587301655876523e-176 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1587301655876523e-176 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.896825413969131e-177Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.896825413969131e-177 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.896825413969131e-177 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.896825413969131e-177 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9742063534922827e-177 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9742063534922827e-177 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9742063534922827e-177 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9742063534922827e-177 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.935515883730707e-178 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.935515883730707e-178 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.935515883730707e-178 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.935515883730707e-178 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2338789709326767e-178 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2338789709326767e-178 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2338789709326767e-178 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2338789709326767e-178 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.084697427331692e-179Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.084697427331692e-179 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.084697427331692e-179 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.084697427331692e-179 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.71174356832923e-180 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.71174356832923e-180 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.71174356832923e-180 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.71174356832923e-180 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9279358920823073e-180 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9279358920823073e-180 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9279358920823073e-180 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9279358920823073e-180 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.819839730205768e-181 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.819839730205768e-181 | |
| epoch 153 step 104400 | 128 batches | lr 0.000421 | ms/batch 5118.50 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.819839730205768e-181 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.819839730205768e-181 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.204959932551442e-181 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.204959932551442e-181 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.204959932551442e-181 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.204959932551442e-181 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.012399831378605e-182 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.012399831378605e-182 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.012399831378605e-182 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.012399831378605e-182 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.530999578446513e-183 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.530999578446513e-183 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.530999578446513e-183 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.530999578446513e-183 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8827498946116282e-183Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8827498946116282e-183 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8827498946116282e-183 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8827498946116282e-183 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.706874736529071e-184 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.706874736529071e-184 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.706874736529071e-184 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.706874736529071e-184 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1767186841322676e-184 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1767186841322676e-184 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1767186841322676e-184 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1767186841322676e-184 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.941796710330669e-185 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.941796710330669e-185 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.941796710330669e-185 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.941796710330669e-185 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.354491775826673e-186 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.354491775826673e-186 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.354491775826673e-186 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.354491775826673e-186 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8386229439566682e-186 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8386229439566682e-186 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8386229439566682e-186 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8386229439566682e-186 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5965573598916705e-187 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5965573598916705e-187 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5965573598916705e-187 | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5965573598916705e-187 | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104410 | 138 batches | lr 0.000421 | ms/batch 5117.97 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1491393399729176e-187 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1491393399729176e-187 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1491393399729176e-187 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1491393399729176e-187 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.872848349932294e-188 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.872848349932294e-188 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.872848349932294e-188 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.872848349932294e-188 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.182120874830735e-189 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.182120874830735e-189 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.182120874830735e-189 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.182120874830735e-189 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7955302187076838e-189 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7955302187076838e-189 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7955302187076838e-189 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7955302187076838e-189 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4888255467692094e-190 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4888255467692094e-190 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4888255467692094e-190 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4888255467692094e-190 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1222063866923024e-190 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1222063866923024e-190 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1222063866923024e-190 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1222063866923024e-190 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.805515966730756e-191 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.805515966730756e-191 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.805515966730756e-191 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.805515966730756e-191 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.01378991682689e-192 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.01378991682689e-192Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.01378991682689e-192 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.01378991682689e-192 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7534474792067224e-192 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7534474792067224e-192 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7534474792067224e-192 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7534474792067224e-192 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.383618698016806e-193 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.383618698016806e-193 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.383618698016806e-193 | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.383618698016806e-193 | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104420 | 148 batches | lr 0.000421 | ms/batch 5118.21 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0959046745042015e-193 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0959046745042015e-193 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0959046745042015e-193 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0959046745042015e-193 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.739761686260504e-194 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.739761686260504e-194 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.739761686260504e-194 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.739761686260504e-194 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.84940421565126e-195 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.84940421565126e-195 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.84940421565126e-195 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.84940421565126e-195 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.712351053912815e-195 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.712351053912815e-195 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.712351053912815e-195 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.712351053912815e-195 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.280877634782037e-196 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.280877634782037e-196Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.280877634782037e-196 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.280877634782037e-196 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0702194086955093e-196 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0702194086955093e-196 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0702194086955093e-196 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0702194086955093e-196 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6755485217387732e-197Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6755485217387732e-197Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6755485217387732e-197 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6755485217387732e-197 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.688871304346933e-198 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.688871304346933e-198 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.688871304346933e-198 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.688871304346933e-198 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6722178260867333e-198 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6722178260867333e-198 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6722178260867333e-198 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6722178260867333e-198 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.180544565216833e-199 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.180544565216833e-199Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.180544565216833e-199 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104430 | 158 batches | lr 0.000421 | ms/batch 5118.15 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.180544565216833e-199 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0451361413042083e-199 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0451361413042083e-199 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0451361413042083e-199 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0451361413042083e-199 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.612840353260521e-200 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.612840353260521e-200Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.612840353260521e-200 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.612840353260521e-200 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.532100883151302e-201 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.532100883151302e-201 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.532100883151302e-201 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.532100883151302e-201 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6330252207878255e-201 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6330252207878255e-201 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6330252207878255e-201 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6330252207878255e-201 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.082563051969564e-202 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.082563051969564e-202Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.082563051969564e-202 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.082563051969564e-202 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.020640762992391e-202 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.020640762992391e-202 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.020640762992391e-202 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.020640762992391e-202 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5516019074809773e-203 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5516019074809773e-203 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5516019074809773e-203 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5516019074809773e-203 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.379004768702443e-204 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.379004768702443e-204 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.379004768702443e-204 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.379004768702443e-204 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5947511921756108e-204 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5947511921756108e-204 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5947511921756108e-204 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5947511921756108e-204 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.986877980439027e-205 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.986877980439027e-205 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.986877980439027e-205 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.986877980439027e-205 | |
| epoch 153 step 104440 | 168 batches | lr 0.000421 | ms/batch 5117.66 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.967194951097568e-206 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.967194951097568e-206 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.967194951097568e-206 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.967194951097568e-206 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.491798737774392e-206 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.491798737774392e-206 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.491798737774392e-206 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.491798737774392e-206 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.22949684443598e-207 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.22949684443598e-207 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.22949684443598e-207 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.22949684443598e-207 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.557374211108995e-207 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.557374211108995e-207 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.557374211108995e-207 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.557374211108995e-207 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8934355277724873e-208 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8934355277724873e-208 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8934355277724873e-208 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8934355277724873e-208 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.733588819431218e-209 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.733588819431218e-209 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.733588819431218e-209 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.733588819431218e-209 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4333972048578046e-209 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4333972048578046e-209 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4333972048578046e-209 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4333972048578046e-209 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.083493012144512e-210 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.083493012144512e-210 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.083493012144512e-210 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.083493012144512e-210 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.520873253036128e-210 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.520873253036128e-210 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.520873253036128e-210 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.520873253036128e-210 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.80218313259032e-211 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.80218313259032e-211 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.80218313259032e-211Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.80218313259032e-211 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104450 | 178 batches | lr 0.00042 | ms/batch 5117.37 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5054578314758e-212Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5054578314758e-212 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5054578314758e-212 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5054578314758e-212 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.37636445786895e-212 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.37636445786895e-212 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.37636445786895e-212 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.37636445786895e-212 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.940911144672375e-213 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.940911144672375e-213 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.940911144672375e-213Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.940911144672375e-213 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4852277861680936e-213 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4852277861680936e-213 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4852277861680936e-213 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4852277861680936e-213 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.713069465420234e-214 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.713069465420234e-214 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.713069465420234e-214Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.713069465420234e-214 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.282673663550585e-215 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.282673663550585e-215 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.282673663550585e-215 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.282673663550585e-215 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3206684158876463e-215 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3206684158876463e-215 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3206684158876463e-215 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3206684158876463e-215 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.801671039719116e-216Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.801671039719116e-216 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.801671039719116e-216 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.801671039719116e-216 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.450417759929779e-216 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.450417759929779e-216 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.450417759929779e-216 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.450417759929779e-216 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6260443998244473e-217Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6260443998244473e-217 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6260443998244473e-217 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104460 | 188 batches | lr 0.00042 | ms/batch 5118.34 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6260443998244473e-217 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.065110999561118e-218 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.065110999561118e-218 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.065110999561118e-218 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.065110999561118e-218 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2662777498902796e-218Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2662777498902796e-218 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2662777498902796e-218 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2662777498902796e-218 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.665694374725699e-219 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.665694374725699e-219 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.665694374725699e-219 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.665694374725699e-219 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4164235936814247e-219 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4164235936814247e-219 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4164235936814247e-219 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4164235936814247e-219 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.541058984203562e-220 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.541058984203562e-220 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.541058984203562e-220 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.541058984203562e-220 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.852647460508905e-221 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.852647460508905e-221Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.852647460508905e-221 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.852647460508905e-221 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2131618651272261e-221 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2131618651272261e-221 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2131618651272261e-221Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2131618651272261e-221 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5329046628180653e-222 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5329046628180653e-222 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5329046628180653e-222 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5329046628180653e-222 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3832261657045163e-222 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3832261657045163e-222 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3832261657045163e-222 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3832261657045163e-222 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.458065414261291e-223 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.458065414261291e-223 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104470 | 198 batches | lr 0.00042 | ms/batch 5118.37 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.458065414261291e-223 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.458065414261291e-223 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.645163535653227e-224 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.645163535653227e-224Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.645163535653227e-224 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.645163535653227e-224 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.161290883913307e-224 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.161290883913307e-224Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.161290883913307e-224 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.161290883913307e-224 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.403227209783267e-225 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.403227209783267e-225 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.403227209783267e-225 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.403227209783267e-225 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3508068024458167e-225 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3508068024458167e-225 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3508068024458167e-225 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3508068024458167e-225 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.377017006114542e-226 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.377017006114542e-226 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.377017006114542e-226 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.377017006114542e-226 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.442542515286355e-227Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.442542515286355e-227 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.442542515286355e-227 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.442542515286355e-227 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1106356288215886e-227Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1106356288215886e-227 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1106356288215886e-227 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1106356288215886e-227 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.276589072053972e-228 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.276589072053972e-228Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.276589072053972e-228 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.276589072053972e-228 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.319147268013493e-228 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.319147268013493e-228 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.319147268013493e-228 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.319147268013493e-228 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2978681700337323e-229 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104480 | 208 batches | lr 0.00042 | ms/batch 5119.34 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2978681700337323e-229 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2978681700337323e-229 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2978681700337323e-229 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.244670425084331e-230 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.244670425084331e-230 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.244670425084331e-230 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.244670425084331e-230 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0611676062710827e-230 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0611676062710827e-230 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0611676062710827e-230 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0611676062710827e-230 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.152919015677707e-231 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.152919015677707e-231 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.152919015677707e-231 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.152919015677707e-231 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2882297539194267e-231Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2882297539194267e-231 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2882297539194267e-231 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2882297539194267e-231 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.220574384798567e-232 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.220574384798567e-232 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.220574384798567e-232 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.220574384798567e-232 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.051435961996417e-233 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.051435961996417e-233 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.051435961996417e-233 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.051435961996417e-233 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0128589904991042e-233Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0128589904991042e-233 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0128589904991042e-233 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0128589904991042e-233 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0321474762477604e-234 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0321474762477604e-234 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0321474762477604e-234 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0321474762477604e-234 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2580368690619401e-234 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2580368690619401e-234 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2580368690619401e-234 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2580368690619401e-234 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1450921726548502e-235 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1450921726548502e-235 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104490 | 218 batches | lr 0.00042 | ms/batch 5118.17 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1450921726548502e-235 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1450921726548502e-235 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.862730431637126e-236Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.862730431637126e-236 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.862730431637126e-236 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.862730431637126e-236 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9656826079092814e-236 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9656826079092814e-236 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9656826079092814e-236 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9656826079092814e-236 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.914206519773204e-237 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.914206519773204e-237 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.914206519773204e-237 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.914206519773204e-237 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.228551629943301e-237 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.228551629943301e-237 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.228551629943301e-237 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.228551629943301e-237 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0713790748582522e-238 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0713790748582522e-238 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0713790748582522e-238 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0713790748582522e-238 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.678447687145631e-239 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.678447687145631e-239 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.678447687145631e-239 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.678447687145631e-239 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9196119217864076e-239 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9196119217864076e-239 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9196119217864076e-239 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9196119217864076e-239 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.799029804466019e-240 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.799029804466019e-240 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.799029804466019e-240 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.799029804466019e-240 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1997574511165048e-240 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1997574511165048e-240 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1997574511165048e-240 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1997574511165048e-240 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.999393627791262e-241 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.999393627791262e-241 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104500 | 228 batches | lr 0.00042 | ms/batch 5120.18 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.999393627791262e-241 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.999393627791262e-241 | |
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384 | |
---------------------------------------------------------------------------------------------------- | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| Eval 418 at step 104500 | time: 1347.99s | valid loss nan | bpc nan | |
---------------------------------------------------------------------------------------------------- | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.498484069478155e-242 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.498484069478155e-242 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.498484069478155e-242 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.498484069478155e-242 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8746210173695387e-242 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8746210173695387e-242 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8746210173695387e-242 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8746210173695387e-242 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.686552543423847e-243 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.686552543423847e-243 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.686552543423847e-243 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.686552543423847e-243 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1716381358559617e-243 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1716381358559617e-243 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1716381358559617e-243 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1716381358559617e-243 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9290953396399042e-244 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9290953396399042e-244Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9290953396399042e-244 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9290953396399042e-244 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.322738349099761e-245 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.322738349099761e-245 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.322738349099761e-245 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.322738349099761e-245 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8306845872749401e-245 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8306845872749401e-245 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8306845872749401e-245 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8306845872749401e-245 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5767114681873503e-246 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5767114681873503e-246 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5767114681873503e-246 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5767114681873503e-246 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1441778670468376e-246Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1441778670468376e-246 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1441778670468376e-246 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1441778670468376e-246 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.860444667617094e-247 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.860444667617094e-247 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.860444667617094e-247 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104510 | 238 batches | lr 0.00042 | ms/batch 11941.77 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.860444667617094e-247 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.151111669042735e-248 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.151111669042735e-248 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.151111669042735e-248 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.151111669042735e-248 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7877779172606837e-248 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7877779172606837e-248 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7877779172606837e-248 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7877779172606837e-248 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4694447931517093e-249 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4694447931517093e-249 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4694447931517093e-249 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4694447931517093e-249 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1173611982879273e-249 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1173611982879273e-249 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1173611982879273e-249 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1173611982879273e-249 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7934029957198183e-250 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7934029957198183e-250 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7934029957198183e-250 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7934029957198183e-250 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.983507489299546e-251 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.983507489299546e-251 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.983507489299546e-251 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.983507489299546e-251 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7458768723248864e-251 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7458768723248864e-251 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7458768723248864e-251 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7458768723248864e-251 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.364692180812216e-252 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.364692180812216e-252 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.364692180812216e-252 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.364692180812216e-252 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.091173045203054e-252 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.091173045203054e-252 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.091173045203054e-252 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.091173045203054e-252 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.727932613007635e-253 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104520 | 248 batches | lr 0.00042 | ms/batch 5118.78 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.727932613007635e-253 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.727932613007635e-253 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.727932613007635e-253 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.819831532519088e-254 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.819831532519088e-254 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.819831532519088e-254 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.819831532519088e-254 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.704957883129772e-254 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.704957883129772e-254Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.704957883129772e-254 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.704957883129772e-254 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.26239470782443e-255 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.26239470782443e-255 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.26239470782443e-255 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.26239470782443e-255 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0655986769561075e-255 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0655986769561075e-255Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0655986769561075e-255 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0655986769561075e-255 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6639966923902686e-256 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6639966923902686e-256Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6639966923902686e-256 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6639966923902686e-256 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.659991730975672e-257 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.659991730975672e-257 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.659991730975672e-257 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.659991730975672e-257 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.664997932743918e-257 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.664997932743918e-257 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.664997932743918e-257 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.664997932743918e-257 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.162494831859795e-258 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.162494831859795e-258 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.162494831859795e-258 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.162494831859795e-258 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0406237079649487e-258 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0406237079649487e-258Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0406237079649487e-258 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0406237079649487e-258 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6015592699123717e-259 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6015592699123717e-259 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6015592699123717e-259 | |
| epoch 153 step 104530 | 258 batches | lr 0.00042 | ms/batch 5118.20 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6015592699123717e-259 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.503898174780929e-260 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.503898174780929e-260 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.503898174780929e-260 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.503898174780929e-260 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6259745436952323e-260 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6259745436952323e-260 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6259745436952323e-260 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6259745436952323e-260 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.064936359238081e-261 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.064936359238081e-261 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.064936359238081e-261 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.064936359238081e-261 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0162340898095202e-261 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0162340898095202e-261 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0162340898095202e-261 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0162340898095202e-261 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5405852245238005e-262 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5405852245238005e-262 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5405852245238005e-262 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5405852245238005e-262 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.351463061309501e-263 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.351463061309501e-263 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.351463061309501e-263 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.351463061309501e-263 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5878657653273753e-263 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5878657653273753e-263 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5878657653273753e-263 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5878657653273753e-263 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9696644133184383e-264 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9696644133184383e-264 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9696644133184383e-264 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9696644133184383e-264 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.924161033296096e-265 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.924161033296096e-265 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.924161033296096e-265 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.924161033296096e-265 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.481040258324024e-265 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.481040258324024e-265 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.481040258324024e-265 | |
| epoch 153 step 104540 | 268 batches | lr 0.00042 | ms/batch 5120.17 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.481040258324024e-265 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.20260064581006e-266 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.20260064581006e-266 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.20260064581006e-266 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.20260064581006e-266 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.550650161452515e-266 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.550650161452515e-266 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.550650161452515e-266 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.550650161452515e-266 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8766254036312874e-267 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8766254036312874e-267 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8766254036312874e-267 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8766254036312874e-267 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.691563509078218e-268Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.691563509078218e-268 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.691563509078218e-268 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.691563509078218e-268 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4228908772695546e-268Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4228908772695546e-268 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4228908772695546e-268 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4228908772695546e-268 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.057227193173887e-269 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.057227193173887e-269 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.057227193173887e-269 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.057227193173887e-269 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5143067982934716e-269 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5143067982934716e-269 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5143067982934716e-269 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5143067982934716e-269 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.785766995733679e-270 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.785766995733679e-270 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.785766995733679e-270 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.785766995733679e-270 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.464417489334198e-271 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.464417489334198e-271Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.464417489334198e-271 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.464417489334198e-271 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3661043723335494e-271 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3661043723335494e-271 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3661043723335494e-271 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3661043723335494e-271 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104550 | 278 batches | lr 0.00042 | ms/batch 5118.63 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.915260930833874e-272 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.915260930833874e-272 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.915260930833874e-272 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.915260930833874e-272 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4788152327084684e-272 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4788152327084684e-272 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4788152327084684e-272 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4788152327084684e-272 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.697038081771171e-273 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.697038081771171e-273 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.697038081771171e-273 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.697038081771171e-273 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.242595204427927e-274Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.242595204427927e-274 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.242595204427927e-274 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.242595204427927e-274 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.310648801106982e-274 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.310648801106982e-274 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.310648801106982e-274 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.310648801106982e-274 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.776622002767455e-275 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.776622002767455e-275 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.776622002767455e-275 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.776622002767455e-275 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4441555006918637e-275 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4441555006918637e-275 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4441555006918637e-275 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4441555006918637e-275 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.610388751729659e-276 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.610388751729659e-276Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.610388751729659e-276 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.610388751729659e-276 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.025971879324148e-277 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.025971879324148e-277 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.025971879324148e-277Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.025971879324148e-277 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.256492969831037e-277 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.256492969831037e-277 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.256492969831037e-277 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.256492969831037e-277 | |
| epoch 153 step 104560 | 288 batches | lr 0.00042 | ms/batch 5119.90 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.641232424577593e-278 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.641232424577593e-278 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.641232424577593e-278 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.641232424577593e-278 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4103081061443981e-278 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4103081061443981e-278 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4103081061443981e-278 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4103081061443981e-278 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5257702653609953e-279 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5257702653609953e-279 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5257702653609953e-279 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5257702653609953e-279 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.814425663402488e-280 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.814425663402488e-280 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.814425663402488e-280 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.814425663402488e-280 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.203606415850622e-280Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.203606415850622e-280 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.203606415850622e-280 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.203606415850622e-280 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.509016039626555e-281 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.509016039626555e-281 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.509016039626555e-281 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.509016039626555e-281 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3772540099066388e-281 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3772540099066388e-281Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3772540099066388e-281 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3772540099066388e-281 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.443135024766597e-282 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.443135024766597e-282Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.443135024766597e-282 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.443135024766597e-282 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.607837561916492e-283 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.607837561916492e-283 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.607837561916492e-283 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.607837561916492e-283 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.151959390479123e-283 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.151959390479123e-283 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104570 | 298 batches | lr 0.00042 | ms/batch 5118.48 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.151959390479123e-283 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.151959390479123e-283 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.379898476197808e-284 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.379898476197808e-284 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.379898476197808e-284 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.379898476197808e-284 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.344974619049452e-284 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.344974619049452e-284 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.344974619049452e-284 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.344974619049452e-284 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.36243654762363e-285 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.36243654762363e-285 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.36243654762363e-285 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.36243654762363e-285 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.406091369059075e-286 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.406091369059075e-286 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.406091369059075e-286 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.406091369059075e-286 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1015228422647686e-286 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1015228422647686e-286 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1015228422647686e-286 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1015228422647686e-286 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.253807105661922e-287 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.253807105661922e-287 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.253807105661922e-287 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.253807105661922e-287 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3134517764154804e-287 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3134517764154804e-287 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3134517764154804e-287 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3134517764154804e-287 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.283629441038701e-288 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.283629441038701e-288 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.283629441038701e-288 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.283629441038701e-288 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.209073602596753e-289Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.209073602596753e-289 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.209073602596753e-289 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.209073602596753e-289 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0522684006491881e-289 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0522684006491881e-289 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0522684006491881e-289 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0522684006491881e-289 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104580 | 308 batches | lr 0.00042 | ms/batch 5118.17 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.1306710016229703e-290 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.1306710016229703e-290 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.1306710016229703e-290 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.1306710016229703e-290 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2826677504057426e-290 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2826677504057426e-290 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2826677504057426e-290 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2826677504057426e-290 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2066693760143564e-291 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2066693760143564e-291 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2066693760143564e-291 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2066693760143564e-291 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.016673440035891e-292 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.016673440035891e-292 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.016673440035891e-292 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.016673440035891e-292 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.004168360008973e-292 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.004168360008973e-292Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.004168360008973e-292 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.004168360008973e-292 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.010420900022432e-293 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.010420900022432e-293 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.010420900022432e-293 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.010420900022432e-293 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.252605225005608e-293 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.252605225005608e-293 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.252605225005608e-293 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.252605225005608e-293 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.13151306251402e-294 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.13151306251402e-294 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.13151306251402e-294 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.13151306251402e-294 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.82878265628505e-295 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.82878265628505e-295 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.82878265628505e-295 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.82878265628505e-295 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9571956640712625e-295 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9571956640712625e-295 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104590 | 318 batches | lr 0.00042 | ms/batch 5117.51 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9571956640712625e-295 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9571956640712625e-295 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.892989160178156e-296 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.892989160178156e-296 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.892989160178156e-296 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.892989160178156e-296 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.223247290044539e-296 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.223247290044539e-296Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.223247290044539e-296 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.223247290044539e-296 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0581182251113476e-297 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0581182251113476e-297 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0581182251113476e-297 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0581182251113476e-297 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.645295562778369e-298 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.645295562778369e-298Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.645295562778369e-298 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.645295562778369e-298 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9113238906945923e-298 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9113238906945923e-298Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9113238906945923e-298 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9113238906945923e-298 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.778309726736481e-299 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.778309726736481e-299 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.778309726736481e-299 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.778309726736481e-299 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1945774316841202e-299 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1945774316841202e-299 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1945774316841202e-299 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1945774316841202e-299 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9864435792103004e-300 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9864435792103004e-300Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9864435792103004e-300 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9864435792103004e-300 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.466108948025751e-301 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.466108948025751e-301 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.466108948025751e-301 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.466108948025751e-301 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8665272370064378e-301Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8665272370064378e-301 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8665272370064378e-301 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104600 | 328 batches | lr 0.00042 | ms/batch 5117.56 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8665272370064378e-301 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.6663180925160944e-302 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.6663180925160944e-302 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.6663180925160944e-302 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.6663180925160944e-302 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1665795231290236e-302 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1665795231290236e-302 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1665795231290236e-302 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1665795231290236e-302 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.916448807822559e-303 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.916448807822559e-303 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.916448807822559e-303 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.916448807822559e-303 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.291122019556398e-304 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.291122019556398e-304 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.291122019556398e-304 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.291122019556398e-304 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8227805048890994e-304 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8227805048890994e-304 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8227805048890994e-304 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8227805048890994e-304 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5569512622227484e-305 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5569512622227484e-305 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5569512622227484e-305 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5569512622227484e-305 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1392378155556871e-305 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1392378155556871e-305 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1392378155556871e-305 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1392378155556871e-305 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.848094538889218e-306 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.848094538889218e-306 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.848094538889218e-306 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.848094538889218e-306 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.120236347223045e-307 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.120236347223045e-307 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.120236347223045e-307 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.120236347223045e-307 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7800590868057611e-307 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7800590868057611e-307 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7800590868057611e-307 | |
| epoch 153 step 104610 | 338 batches | lr 0.00042 | ms/batch 5118.73 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7800590868057611e-307 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.450147717014403e-308 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.450147717014403e-308 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.450147717014403e-308 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.450147717014403e-308 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1125369292536007e-308Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1125369292536007e-308 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1125369292536007e-308 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1125369292536007e-308 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.781342323134e-309 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.781342323134e-309 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.781342323134e-309 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.781342323134e-309 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.953355807835e-310 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.953355807835e-310 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.953355807835e-310 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.953355807835e-310 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.73833895195875e-310Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.73833895195875e-310 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.73833895195875e-310 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.73833895195875e-310 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.345847379897e-311 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.345847379897e-311Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.345847379897e-311 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.345847379897e-311 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.086461844974e-311 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.086461844974e-311 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.086461844974e-311 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.086461844974e-311 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.716154612436e-312 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.716154612436e-312 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.716154612436e-312 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.716154612436e-312 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.7903865311e-313 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.7903865311e-313 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.7903865311e-313 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.7903865311e-313 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.69759663277e-313 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104620 | 348 batches | lr 0.00042 | ms/batch 5118.36 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.69759663277e-313 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.69759663277e-313 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.69759663277e-313 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.243991582e-314 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.243991582e-314 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.243991582e-314 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.243991582e-314 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0609978955e-314 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0609978955e-314 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0609978955e-314 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0609978955e-314 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.65249474e-315 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.65249474e-315 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.65249474e-315 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.65249474e-315 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.63123685e-316 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.63123685e-316 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.63123685e-316 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.63123685e-316 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6578092e-316 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6578092e-316 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6578092e-316 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6578092e-316 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.144523e-317 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.144523e-317Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.144523e-317 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.144523e-317 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.036131e-317 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.036131e-317 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.036131e-317 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.036131e-317 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.590327e-318Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.590327e-318 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.590327e-318 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.590327e-318 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.4758e-319Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.4758e-319 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.4758e-319 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.4758e-319 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.61895e-319 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.61895e-319 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.61895e-319 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.61895e-319 | |
WARNING:root:NaN or Inf found in input tensor. | |
WARNING:root:NaN or Inf found in input tensor. | |
| epoch 153 step 104630 | 358 batches | lr 0.00042 | ms/batch 5118.32 | loss nan | bpc nan | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0474e-320 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0474e-320Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0474e-320 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0474e-320 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.012e-320 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.012e-320 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.012e-320 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.012e-320 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.53e-321 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.53e-321Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.53e-321 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.53e-321 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.3e-322 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.3e-322 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.3e-322 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.3e-322 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0 | |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0 | |
Traceback (most recent call last): | |
File "train.py", line 679, in <module> | |
Traceback (most recent call last): | |
File "train.py", line 679, in <module> | |
Traceback (most recent call last): | |
File "train.py", line 679, in <module> | |
train()train() | |
File "train.py", line 558, in train | |
File "train.py", line 558, in train | |
scaled_loss.backward() | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/contextlib.py", line 119, in __exit__ | |
train() | |
scaled_loss.backward() File "train.py", line 558, in train | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/contextlib.py", line 119, in __exit__ | |
next(self.gen) | |
next(self.gen) File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss | |
scaled_loss.backward() | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/contextlib.py", line 119, in __exit__ | |
next(self.gen) | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss | |
optimizer._post_amp_backward(loss_scaler)optimizer._post_amp_backward(loss_scaler) | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 182, in post_backward_with_master_weights | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 182, in post_backward_with_master_weights | |
optimizer._post_amp_backward(loss_scaler) | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 182, in post_backward_with_master_weights | |
models_are_masters=False) | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/scaler.py", line 117, in unscale | |
models_are_masters=False) | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/scaler.py", line 117, in unscale | |
models_are_masters=False) | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/scaler.py", line 117, in unscale | |
1./scale) | |
ZeroDivisionError: float division by zero | |
1./scale) | |
1./scale) | |
ZeroDivisionError: float division by zero | |
ZeroDivisionError: float division by zero | |
Traceback (most recent call last): | |
File "train.py", line 679, in <module> | |
train() | |
File "train.py", line 558, in train | |
scaled_loss.backward() | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/contextlib.py", line 119, in __exit__ | |
next(self.gen) | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss | |
optimizer._post_amp_backward(loss_scaler) | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 182, in post_backward_with_master_weights | |
models_are_masters=False) | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/scaler.py", line 117, in unscale | |
1./scale) | |
ZeroDivisionError: float division by zero | |
Traceback (most recent call last): | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/runpy.py", line 193, in _run_module_as_main | |
"__main__", mod_spec) | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/runpy.py", line 85, in _run_code | |
exec(code, run_globals) | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/torch/distributed/launch.py", line 246, in <module> | |
main() | |
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/torch/distributed/launch.py", line 242, in main | |
cmd=cmd) | |
subprocess.CalledProcessError: Command '['/home/armanc/miniconda3/envs/transformers/bin/python', '-u', 'train.py', '--local_rank=3', '--cuda', '--data', '../data/text8/', '--dataset', 'text8', '--n_layer', '12', '--d_model', '512', '--n_head', '8', '--d_head', '64', '--d_inner', '2048', '--dropout', '0.1', '--dropatt', '0.0', '--optim', 'adam', '--lr', '0.00025', '--warmup_step', '0', '--max_step', '400000', '--tgt_len', '512', '--mem_len', '512', '--eval_tgt_len', '128', '--batch_size', '22', '--multi_gpu', '--gpu0_bsz', '-1', '--work_dir', '/net/nfs.corp/s2-research/lm-long/exp_10_18/seq16k_w512_bsz4_nodilation_lr0.0005_batch8', '--attention_mode', 'tvm', '--mem_len', '0', '--tgt_len', '16384', '--eval_tgt_len', '16384', '--attention_window', '512', '--dilation_increase', '0', '--lr', '0.0005', '--batch_size', '2', '--warmup', '2000', '--fp16', '--eval-interval', '250', '--batch_chunk', '2']' returned non-zero exit status 1. | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment