Skip to content

Instantly share code, notes, and snippets.

@zomux
Created April 1, 2021 04:04
Show Gist options
  • Save zomux/5cf5fcf0b18085baf16498ee8a51918b to your computer and use it in GitHub Desktop.
Save zomux/5cf5fcf0b18085baf16498ee8a51918b to your computer and use it in GitHub Desktop.
=============
== PyTorch ==
=============
NVIDIA Release 19.10 (build 8472689)
PyTorch Version 1.3.0a0+24ae9b5
Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Copyright (c) 2014-2019 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
NOTE: Legacy NVIDIA Driver detected. Compatibility mode ENABLED.
NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.
| distributed init (rank 3): tcp://localhost:11813
| distributed init (rank 2): tcp://localhost:11813
| distributed init (rank 0): tcp://localhost:11813
| distributed init (rank 4): tcp://localhost:11813
| initialized host abfb25f6690a as rank 4
| distributed init (rank 1): tcp://localhost:11813
| initialized host abfb25f6690a as rank 1
| distributed init (rank 5): tcp://localhost:11813
| initialized host abfb25f6690a as rank 5
| distributed init (rank 6): tcp://localhost:11813
| distributed init (rank 7): tcp://localhost:11813
| initialized host abfb25f6690a as rank 6
| initialized host abfb25f6690a as rank 7
| initialized host abfb25f6690a as rank 3
| initialized host abfb25f6690a as rank 2
| initialized host abfb25f6690a as rank 0
Namespace(activation_dropout=0.0, activation_fn='relu', adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_wmt_en_de', attention_dropout=0.0, beam_size=10, best_checkpoint_metric='argmax_r', bleurt_scale=False, bucket_cap_mb=25, clip_norm=0.0, contrastive=False, cpu=False, criterion='reward_cross_entropy', curriculum=0, data='/data/CMLMTP_JAEN/train/aspec_jaen', dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, decoder_wise_training=False, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:11813', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, dropout=0.3, early_stop=True, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, find_unused_parameters=False, fix_batches_to_gpus=False, focus=-1, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_load=False, left_pad_source='True', left_pad_target='False', load='/data/CMLMTP_JAEN/train/baseline_jaen/epoch72-81.pt', log_format=None, log_interval=1000, lr=[0.0001], lr_scheduler='fixed', lr_shrink=0.1, m=30.0, mask=False, masking=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=1024, max_tokens_valid=500, max_update=0, maximize_best_checkpoint_metric=True, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, momentum=0.0, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_token_positional_embeddings=False, num_workers=0, optimizer='sgd', optimizer_overrides='{}', proxyloss2=True, raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', reward='sbleu', save_dir='checkpoints', save_interval=1, save_interval_updates=20, save_path='', seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=True, skip_invalid_size_inputs_valid_test=False, source_lang='ja', target_lang='en', task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_decoder_only=True, train_subset='train', update_freq=[1], upsample_primary=1, use_bmuf=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_updates=0, weight_decay=0.0)
| [ja] dictionary: 37912 types
| [en] dictionary: 32440 types
| loaded 2000 examples from: /data/CMLMTP_JAEN/train/aspec_jaen/valid.ja-en.ja
| loaded 2000 examples from: /data/CMLMTP_JAEN/train/aspec_jaen/valid.ja-en.en
| /data/CMLMTP_JAEN/train/aspec_jaen valid ja-en 2000 examples
TransformerModel(
(encoder): TransformerEncoder(
(embed_tokens): Embedding(37912, 512, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(1): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(2): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(3): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(4): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(5): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): TransformerDecoder(
(embed_tokens): Embedding(32440, 512, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(1): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(2): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(3): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(4): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
(5): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([512]), eps=1e-05, elementwise_affine=True)
)
)
)
)
| model transformer_wmt_en_de, criterion RewardCrossEntropyCriterion
| num. model params: 80158720 (num. trained: 41833472)
loading model from /data/CMLMTP_JAEN/train/baseline_jaen/epoch72-81.pt
checkpoint loaded
checkpoint loaded
checkpoint loaded
checkpoint loaded
checkpoint loaded
checkpoint loaded
checkpoint loaded
checkpoint loaded
checkpoint loaded
checkpoint loaded
| training on 8 GPUs
| max tokens per GPU = 1024 and max sentences per GPU = None
| no existing checkpoint found checkpoints/checkpoint_last.pt
| loading train data for epoch 0
| loaded 2188816 examples from: /data/CMLMTP_JAEN/train/aspec_jaen/train.ja-en.ja
| loaded 2188816 examples from: /data/CMLMTP_JAEN/train/aspec_jaen/train.ja-en.en
| /data/CMLMTP_JAEN/train/aspec_jaen train ja-en 2188816 examples
| set valid tokens to 305
| epoch 000 | valid on 'valid' subset | loss 4.938 | nll_loss -0.006 | ppl 1.00 | num_updates 0 | best_r 40.0904 | argmax_r 32.6337 | avg_r 31.6776
| epoch 001 | valid on 'valid' subset | loss 4.833 | nll_loss -0.005 | ppl 1.00 | num_updates 4220 | best_r 40.7281 | argmax_r 33.6096 | avg_r 32.4525
saving checkpoint ...
| epoch 001 | valid on 'valid' subset | loss 4.826 | nll_loss -0.004 | ppl 1.00 | num_updates 4240 | best_loss 4.82563 | best_r 40.6959 | argmax_r 33.5368 | avg_r 32.429
| epoch 001 | valid on 'valid' subset | loss 4.810 | nll_loss -0.004 | ppl 1.00 | num_updates 4260 | best_loss 4.80978 | best_r 40.6579 | argmax_r 33.4991 | avg_r 32.4141
| epoch 001 | valid on 'valid' subset | loss 4.831 | nll_loss -0.004 | ppl 1.00 | num_updates 4280 | best_loss 4.83111 | best_r 40.7304 | argmax_r 33.6052 | avg_r 32.4628
| epoch 001 | valid on 'valid' subset | loss 4.825 | nll_loss -0.005 | ppl 1.00 | num_updates 4300 | best_loss 4.8253 | best_r 40.6661 | argmax_r 33.5619 | avg_r 32.4189
| epoch 001 | valid on 'valid' subset | loss 4.820 | nll_loss -0.005 | ppl 1.00 | num_updates 4320 | best_loss 4.82037 | best_r 40.6813 | argmax_r 33.4498 | avg_r 32.4188
| epoch 001 | valid on 'valid' subset | loss 4.830 | nll_loss -0.005 | ppl 1.00 | num_updates 4340 | best_loss 4.82951 | best_r 40.7559 | argmax_r 33.4846 | avg_r 32.4386
| epoch 001 | valid on 'valid' subset | loss 4.839 | nll_loss -0.005 | ppl 1.00 | num_updates 4360 | best_loss 4.83905 | best_r 40.7116 | argmax_r 33.4722 | avg_r 32.3937
| epoch 001 | valid on 'valid' subset | loss 4.842 | nll_loss -0.005 | ppl 1.00 | num_updates 4380 | best_loss 4.84157 | best_r 40.7409 | argmax_r 33.5761 | avg_r 32.4335
| epoch 001 | valid on 'valid' subset | loss 4.850 | nll_loss -0.005 | ppl 1.00 | num_updates 4400 | best_loss 4.85025 | best_r 40.7801 | argmax_r 33.554 | avg_r 32.4251
| epoch 001 | valid on 'valid' subset | loss 4.831 | nll_loss -0.005 | ppl 1.00 | num_updates 4420 | best_loss 4.83093 | best_r 40.7525 | argmax_r 33.5771 | avg_r 32.4321
| epoch 001 | valid on 'valid' subset | loss 4.850 | nll_loss -0.005 | ppl 1.00 | num_updates 4440 | best_loss 4.85028 | best_r 40.7947 | argmax_r 33.4874 | avg_r 32.4202
| epoch 001 | valid on 'valid' subset | loss 4.839 | nll_loss -0.005 | ppl 1.00 | num_updates 4460 | best_loss 4.83865 | best_r 40.7644 | argmax_r 33.4734 | avg_r 32.4341
| epoch 001 | valid on 'valid' subset | loss 4.841 | nll_loss -0.005 | ppl 1.00 | num_updates 4480 | best_loss 4.8407 | best_r 40.7779 | argmax_r 33.4371 | avg_r 32.4359
| epoch 001 | valid on 'valid' subset | loss 4.834 | nll_loss -0.005 | ppl 1.00 | num_updates 4500 | best_loss 4.83447 | best_r 40.7375 | argmax_r 33.5064 | avg_r 32.4108
| epoch 001 | valid on 'valid' subset | loss 4.820 | nll_loss -0.005 | ppl 1.00 | num_updates 4520 | best_loss 4.8202 | best_r 40.7192 | argmax_r 33.515 | avg_r 32.4252
| epoch 001 | valid on 'valid' subset | loss 4.830 | nll_loss -0.005 | ppl 1.00 | num_updates 4540 | best_loss 4.83017 | best_r 40.7456 | argmax_r 33.5407 | avg_r 32.4238
| epoch 001 | valid on 'valid' subset | loss 4.843 | nll_loss -0.004 | ppl 1.00 | num_updates 4560 | best_loss 4.84287 | best_r 40.7946 | argmax_r 33.5867 | avg_r 32.4409
| epoch 001 | valid on 'valid' subset | loss 4.826 | nll_loss -0.004 | ppl 1.00 | num_updates 4580 | best_loss 4.82639 | best_r 40.7444 | argmax_r 33.4991 | avg_r 32.4352
| epoch 001 | valid on 'valid' subset | loss 4.809 | nll_loss -0.004 | ppl 1.00 | num_updates 4600 | best_loss 4.80854 | best_r 40.6692 | argmax_r 33.5111 | avg_r 32.4276
| epoch 001 | valid on 'valid' subset | loss 4.802 | nll_loss -0.005 | ppl 1.00 | num_updates 4620 | best_loss 4.80233 | best_r 40.6477 | argmax_r 33.4774 | avg_r 32.3863
| epoch 001 | valid on 'valid' subset | loss 4.821 | nll_loss -0.005 | ppl 1.00 | num_updates 4640 | best_loss 4.82059 | best_r 40.7154 | argmax_r 33.51 | avg_r 32.4117
| epoch 001 | valid on 'valid' subset | loss 4.818 | nll_loss -0.005 | ppl 1.00 | num_updates 4660 | best_loss 4.81803 | best_r 40.7058 | argmax_r 33.4692 | avg_r 32.3981
| epoch 001 | valid on 'valid' subset | loss 4.813 | nll_loss -0.005 | ppl 1.00 | num_updates 4680 | best_loss 4.81277 | best_r 40.7047 | argmax_r 33.4835 | avg_r 32.4083
| epoch 001 | valid on 'valid' subset | loss 4.817 | nll_loss -0.005 | ppl 1.00 | num_updates 4700 | best_loss 4.81703 | best_r 40.7344 | argmax_r 33.4622 | avg_r 32.4079
| epoch 001 | valid on 'valid' subset | loss 4.823 | nll_loss -0.005 | ppl 1.00 | num_updates 4720 | best_loss 4.82265 | best_r 40.7469 | argmax_r 33.4826 | avg_r 32.4112
| epoch 001 | valid on 'valid' subset | loss 4.856 | nll_loss -0.005 | ppl 1.00 | num_updates 4740 | best_loss 4.85554 | best_r 40.7927 | argmax_r 33.5364 | avg_r 32.4113
| epoch 001 | valid on 'valid' subset | loss 4.831 | nll_loss -0.005 | ppl 1.00 | num_updates 4760 | best_loss 4.83119 | best_r 40.7412 | argmax_r 33.4418 | avg_r 32.4005
| epoch 001 | valid on 'valid' subset | loss 4.861 | nll_loss -0.005 | ppl 1.00 | num_updates 4780 | best_loss 4.86091 | best_r 40.7967 | argmax_r 33.4535 | avg_r 32.4353
| epoch 001 | valid on 'valid' subset | loss 4.845 | nll_loss -0.005 | ppl 1.00 | num_updates 4800 | best_loss 4.84495 | best_r 40.7403 | argmax_r 33.4732 | avg_r 32.4023
| epoch 001 | valid on 'valid' subset | loss 4.874 | nll_loss -0.005 | ppl 1.00 | num_updates 4820 | best_loss 4.87359 | best_r 40.8096 | argmax_r 33.5293 | avg_r 32.4127
| epoch 001 | valid on 'valid' subset | loss 4.859 | nll_loss -0.005 | ppl 1.00 | num_updates 4840 | best_loss 4.85852 | best_r 40.7833 | argmax_r 33.5031 | avg_r 32.4063
| epoch 001 | valid on 'valid' subset | loss 4.841 | nll_loss -0.005 | ppl 1.00 | num_updates 4860 | best_loss 4.84147 | best_r 40.7513 | argmax_r 33.5148 | avg_r 32.3947
| epoch 001 | valid on 'valid' subset | loss 4.849 | nll_loss -0.005 | ppl 1.00 | num_updates 4880 | best_loss 4.84891 | best_r 40.7914 | argmax_r 33.5138 | avg_r 32.4108
| epoch 001 | valid on 'valid' subset | loss 4.848 | nll_loss -0.005 | ppl 1.00 | num_updates 4900 | best_loss 4.84754 | best_r 40.7675 | argmax_r 33.5151 | avg_r 32.4302
| epoch 001 | valid on 'valid' subset | loss 4.850 | nll_loss -0.005 | ppl 1.00 | num_updates 4920 | best_loss 4.84984 | best_r 40.7878 | argmax_r 33.5726 | avg_r 32.4617
| epoch 001 | valid on 'valid' subset | loss 4.859 | nll_loss -0.005 | ppl 1.00 | num_updates 4940 | best_loss 4.85873 | best_r 40.7539 | argmax_r 33.5288 | avg_r 32.4006
| epoch 001 | valid on 'valid' subset | loss 4.833 | nll_loss -0.005 | ppl 1.00 | num_updates 4960 | best_loss 4.83271 | best_r 40.6804 | argmax_r 33.5517 | avg_r 32.4193
| epoch 001 | valid on 'valid' subset | loss 4.789 | nll_loss -0.005 | ppl 1.00 | num_updates 4980 | best_loss 4.78948 | best_r 40.622 | argmax_r 33.5782 | avg_r 32.4038
| epoch 001 | valid on 'valid' subset | loss 4.824 | nll_loss -0.004 | ppl 1.00 | num_updates 5000 | best_loss 4.8236 | best_r 40.7503 | argmax_r 33.5119 | avg_r 32.4442
| epoch 001: 5000 / 11745 loss=5.437, nll_loss=-0.008, ppl=0.99, wps=1987, ups=0, wpb=5428.263, bsz=186.486, num_updates=5001, lr=0.0001, gnorm=7.710, clip=0.000, oom=0.000, wall=13730, train_wall=7091, best_r=45.6077, argmax_r=37.3044, avg_r=36.0153
| epoch 001 | valid on 'valid' subset | loss 4.803 | nll_loss -0.004 | ppl 1.00 | num_updates 5020 | best_loss 4.80312 | best_r 40.6994 | argmax_r 33.588 | avg_r 32.4564
| epoch 001 | valid on 'valid' subset | loss 4.802 | nll_loss -0.005 | ppl 1.00 | num_updates 5040 | best_loss 4.80249 | best_r 40.6695 | argmax_r 33.5234 | avg_r 32.4429
| epoch 001 | valid on 'valid' subset | loss 4.782 | nll_loss -0.005 | ppl 1.00 | num_updates 5060 | best_loss 4.78177 | best_r 40.569 | argmax_r 33.5169 | avg_r 32.3851
| epoch 001 | valid on 'valid' subset | loss 4.800 | nll_loss -0.005 | ppl 1.00 | num_updates 5080 | best_loss 4.80045 | best_r 40.648 | argmax_r 33.5445 | avg_r 32.4193
| epoch 001 | valid on 'valid' subset | loss 4.808 | nll_loss -0.005 | ppl 1.00 | num_updates 5100 | best_loss 4.80786 | best_r 40.6161 | argmax_r 33.5065 | avg_r 32.3662
| epoch 001 | valid on 'valid' subset | loss 4.779 | nll_loss -0.005 | ppl 1.00 | num_updates 5120 | best_loss 4.77858 | best_r 40.6065 | argmax_r 33.5249 | avg_r 32.3951
| epoch 001 | valid on 'valid' subset | loss 4.775 | nll_loss -0.005 | ppl 1.00 | num_updates 5140 | best_loss 4.77459 | best_r 40.5257 | argmax_r 33.4622 | avg_r 32.353
| epoch 001 | valid on 'valid' subset | loss 4.758 | nll_loss -0.004 | ppl 1.00 | num_updates 5160 | best_loss 4.75773 | best_r 40.5015 | argmax_r 33.5156 | avg_r 32.3799
| epoch 001 | valid on 'valid' subset | loss 4.811 | nll_loss -0.005 | ppl 1.00 | num_updates 5180 | best_loss 4.81091 | best_r 40.7115 | argmax_r 33.4916 | avg_r 32.4289
| epoch 001 | valid on 'valid' subset | loss 4.802 | nll_loss -0.005 | ppl 1.00 | num_updates 5200 | best_loss 4.80218 | best_r 40.5906 | argmax_r 33.4486 | avg_r 32.3509
| epoch 001 | valid on 'valid' subset | loss 4.801 | nll_loss -0.005 | ppl 1.00 | num_updates 5220 | best_loss 4.80094 | best_r 40.5927 | argmax_r 33.4342 | avg_r 32.3993
| epoch 001 | valid on 'valid' subset | loss 4.793 | nll_loss -0.005 | ppl 1.00 | num_updates 5240 | best_loss 4.79317 | best_r 40.5405 | argmax_r 33.3991 | avg_r 32.3642
| epoch 001 | valid on 'valid' subset | loss 4.813 | nll_loss -0.005 | ppl 1.00 | num_updates 5260 | best_loss 4.81273 | best_r 40.5973 | argmax_r 33.3819 | avg_r 32.3713
| epoch 001 | valid on 'valid' subset | loss 4.800 | nll_loss -0.005 | ppl 1.00 | num_updates 5280 | best_loss 4.80028 | best_r 40.4862 | argmax_r 33.3892 | avg_r 32.3375
| epoch 001 | valid on 'valid' subset | loss 4.791 | nll_loss -0.005 | ppl 1.00 | num_updates 5300 | best_loss 4.7911 | best_r 40.5205 | argmax_r 33.404 | avg_r 32.3625
| epoch 001 | valid on 'valid' subset | loss 4.829 | nll_loss -0.005 | ppl 1.00 | num_updates 5320 | best_loss 4.82941 | best_r 40.6471 | argmax_r 33.4486 | avg_r 32.3898
| epoch 001 | valid on 'valid' subset | loss 4.816 | nll_loss -0.004 | ppl 1.00 | num_updates 5340 | best_loss 4.8163 | best_r 40.5625 | argmax_r 33.3583 | avg_r 32.3591
| epoch 001 | valid on 'valid' subset | loss 4.824 | nll_loss -0.004 | ppl 1.00 | num_updates 5360 | best_loss 4.82421 | best_r 40.6097 | argmax_r 33.4312 | avg_r 32.3836
| epoch 001 | valid on 'valid' subset | loss 4.818 | nll_loss -0.004 | ppl 1.00 | num_updates 5380 | best_loss 4.81846 | best_r 40.6242 | argmax_r 33.4585 | avg_r 32.3833
| epoch 001 | valid on 'valid' subset | loss 4.798 | nll_loss -0.005 | ppl 1.00 | num_updates 5400 | best_loss 4.79765 | best_r 40.567 | argmax_r 33.4219 | avg_r 32.3815
| epoch 001 | valid on 'valid' subset | loss 4.822 | nll_loss -0.004 | ppl 1.00 | num_updates 5420 | best_loss 4.82175 | best_r 40.599 | argmax_r 33.362 | avg_r 32.3536
| epoch 001 | valid on 'valid' subset | loss 4.794 | nll_loss -0.005 | ppl 1.00 | num_updates 5440 | best_loss 4.79434 | best_r 40.5451 | argmax_r 33.3803 | avg_r 32.3588
| epoch 001 | valid on 'valid' subset | loss 4.816 | nll_loss -0.005 | ppl 1.00 | num_updates 5460 | best_loss 4.81589 | best_r 40.5969 | argmax_r 33.3855 | avg_r 32.3603
| epoch 001 | valid on 'valid' subset | loss 4.786 | nll_loss -0.005 | ppl 1.00 | num_updates 5480 | best_loss 4.78559 | best_r 40.4625 | argmax_r 33.3201 | avg_r 32.3499
| epoch 001 | valid on 'valid' subset | loss 4.822 | nll_loss -0.004 | ppl 1.00 | num_updates 5500 | best_loss 4.82189 | best_r 40.6085 | argmax_r 33.3766 | avg_r 32.3507
| epoch 001 | valid on 'valid' subset | loss 4.850 | nll_loss -0.005 | ppl 1.00 | num_updates 5520 | best_loss 4.85028 | best_r 40.698 | argmax_r 33.3645 | avg_r 32.3834
| epoch 001 | valid on 'valid' subset | loss 4.827 | nll_loss -0.005 | ppl 1.00 | num_updates 5540 | best_loss 4.82713 | best_r 40.643 | argmax_r 33.3434 | avg_r 32.3665
| epoch 001 | valid on 'valid' subset | loss 4.824 | nll_loss -0.005 | ppl 1.00 | num_updates 5560 | best_loss 4.82392 | best_r 40.6509 | argmax_r 33.3901 | avg_r 32.3704
| epoch 001 | valid on 'valid' subset | loss 4.810 | nll_loss -0.005 | ppl 1.00 | num_updates 5580 | best_loss 4.8101 | best_r 40.6334 | argmax_r 33.4085 | avg_r 32.37
| epoch 001 | valid on 'valid' subset | loss 4.839 | nll_loss -0.005 | ppl 1.00 | num_updates 5600 | best_loss 4.83895 | best_r 40.6981 | argmax_r 33.3935 | avg_r 32.3832
| epoch 001 | valid on 'valid' subset | loss 4.816 | nll_loss -0.005 | ppl 1.00 | num_updates 5620 | best_loss 4.81647 | best_r 40.6793 | argmax_r 33.3626 | avg_r 32.3866
| epoch 001 | valid on 'valid' subset | loss 4.830 | nll_loss -0.005 | ppl 1.00 | num_updates 5640 | best_loss 4.82952 | best_r 40.7024 | argmax_r 33.3724 | avg_r 32.4024
| epoch 001 | valid on 'valid' subset | loss 4.811 | nll_loss -0.005 | ppl 1.00 | num_updates 5660 | best_loss 4.81116 | best_r 40.6063 | argmax_r 33.3825 | avg_r 32.3911
| epoch 001 | valid on 'valid' subset | loss 4.814 | nll_loss -0.005 | ppl 1.00 | num_updates 5680 | best_loss 4.81428 | best_r 40.572 | argmax_r 33.3363 | avg_r 32.375
| epoch 001 | valid on 'valid' subset | loss 4.832 | nll_loss -0.005 | ppl 1.00 | num_updates 5700 | best_loss 4.83219 | best_r 40.6964 | argmax_r 33.4186 | avg_r 32.3803
| epoch 001 | valid on 'valid' subset | loss 4.801 | nll_loss -0.005 | ppl 1.00 | num_updates 5720 | best_loss 4.80124 | best_r 40.579 | argmax_r 33.4426 | avg_r 32.3595
| epoch 001 | valid on 'valid' subset | loss 4.799 | nll_loss -0.005 | ppl 1.00 | num_updates 5740 | best_loss 4.79918 | best_r 40.6125 | argmax_r 33.4549 | avg_r 32.3827
| epoch 001 | valid on 'valid' subset | loss 4.801 | nll_loss -0.005 | ppl 1.00 | num_updates 5760 | best_loss 4.80144 | best_r 40.6034 | argmax_r 33.3938 | avg_r 32.3667
| epoch 001 | valid on 'valid' subset | loss 4.833 | nll_loss -0.004 | ppl 1.00 | num_updates 5780 | best_loss 4.8333 | best_r 40.7161 | argmax_r 33.4394 | avg_r 32.3967
| epoch 001 | valid on 'valid' subset | loss 4.820 | nll_loss -0.005 | ppl 1.00 | num_updates 5800 | best_loss 4.81971 | best_r 40.6422 | argmax_r 33.478 | avg_r 32.411
| epoch 001 | valid on 'valid' subset | loss 4.796 | nll_loss -0.005 | ppl 1.00 | num_updates 5820 | best_loss 4.7964 | best_r 40.6064 | argmax_r 33.4833 | avg_r 32.4216
| epoch 001 | valid on 'valid' subset | loss 4.823 | nll_loss -0.005 | ppl 1.00 | num_updates 5840 | best_loss 4.82336 | best_r 40.6684 | argmax_r 33.4882 | avg_r 32.4155
| epoch 001 | valid on 'valid' subset | loss 4.862 | nll_loss -0.004 | ppl 1.00 | num_updates 5860 | best_loss 4.86202 | best_r 40.7904 | argmax_r 33.4688 | avg_r 32.4375
| epoch 001 | valid on 'valid' subset | loss 4.801 | nll_loss -0.004 | ppl 1.00 | num_updates 5880 | best_loss 4.8014 | best_r 40.6211 | argmax_r 33.4592 | avg_r 32.3909
| epoch 001 | valid on 'valid' subset | loss 4.820 | nll_loss -0.004 | ppl 1.00 | num_updates 5900 | best_loss 4.82003 | best_r 40.647 | argmax_r 33.387 | avg_r 32.417
| epoch 001 | valid on 'valid' subset | loss 4.826 | nll_loss -0.004 | ppl 1.00 | num_updates 5920 | best_loss 4.82614 | best_r 40.651 | argmax_r 33.427 | avg_r 32.4206
| epoch 001 | valid on 'valid' subset | loss 4.810 | nll_loss -0.005 | ppl 1.00 | num_updates 5940 | best_loss 4.80952 | best_r 40.5906 | argmax_r 33.3826 | avg_r 32.4061
| epoch 001 | valid on 'valid' subset | loss 4.798 | nll_loss -0.005 | ppl 1.00 | num_updates 5960 | best_loss 4.79813 | best_r 40.554 | argmax_r 33.3758 | avg_r 32.3713
| epoch 001 | valid on 'valid' subset | loss 4.802 | nll_loss -0.005 | ppl 1.00 | num_updates 5980 | best_loss 4.80183 | best_r 40.5631 | argmax_r 33.3109 | avg_r 32.3829
| epoch 001 | valid on 'valid' subset | loss 4.812 | nll_loss -0.005 | ppl 1.00 | num_updates 6000 | best_loss 4.81227 | best_r 40.5915 | argmax_r 33.3953 | avg_r 32.377
| epoch 001: 6000 / 11745 loss=5.424, nll_loss=-0.008, ppl=0.99, wps=1986, ups=0, wpb=5428.903, bsz=186.520, num_updates=6001, lr=0.0001, gnorm=7.694, clip=0.000, oom=0.000, wall=16470, train_wall=8507, best_r=45.6136, argmax_r=37.3314, avg_r=36.0432
| epoch 001 | valid on 'valid' subset | loss 4.802 | nll_loss -0.005 | ppl 1.00 | num_updates 6020 | best_loss 4.80182 | best_r 40.5782 | argmax_r 33.3913 | avg_r 32.3779
| epoch 001 | valid on 'valid' subset | loss 4.793 | nll_loss -0.005 | ppl 1.00 | num_updates 6040 | best_loss 4.79331 | best_r 40.5238 | argmax_r 33.3454 | avg_r 32.3657
| epoch 001 | valid on 'valid' subset | loss 4.839 | nll_loss -0.004 | ppl 1.00 | num_updates 6060 | best_loss 4.83935 | best_r 40.6896 | argmax_r 33.3541 | avg_r 32.3876
| epoch 001 | valid on 'valid' subset | loss 4.818 | nll_loss -0.005 | ppl 1.00 | num_updates 6080 | best_loss 4.81817 | best_r 40.641 | argmax_r 33.3335 | avg_r 32.3739
| epoch 001 | valid on 'valid' subset | loss 4.803 | nll_loss -0.005 | ppl 1.00 | num_updates 6100 | best_loss 4.80297 | best_r 40.591 | argmax_r 33.3896 | avg_r 32.3927
| epoch 001 | valid on 'valid' subset | loss 4.799 | nll_loss -0.005 | ppl 1.00 | num_updates 6120 | best_loss 4.79933 | best_r 40.5156 | argmax_r 33.3499 | avg_r 32.3303
| epoch 001 | valid on 'valid' subset | loss 4.804 | nll_loss -0.005 | ppl 1.00 | num_updates 6140 | best_loss 4.80426 | best_r 40.5896 | argmax_r 33.4101 | avg_r 32.3848
| epoch 001 | valid on 'valid' subset | loss 4.781 | nll_loss -0.005 | ppl 1.00 | num_updates 6160 | best_loss 4.78145 | best_r 40.5283 | argmax_r 33.3543 | avg_r 32.3747
| epoch 001 | valid on 'valid' subset | loss 4.801 | nll_loss -0.005 | ppl 1.00 | num_updates 6180 | best_loss 4.80065 | best_r 40.5964 | argmax_r 33.3994 | avg_r 32.3709
| epoch 001 | valid on 'valid' subset | loss 4.816 | nll_loss -0.004 | ppl 1.00 | num_updates 6200 | best_loss 4.8158 | best_r 40.6015 | argmax_r 33.3159 | avg_r 32.3638
| epoch 001 | valid on 'valid' subset | loss 4.818 | nll_loss -0.004 | ppl 1.00 | num_updates 6220 | best_loss 4.81848 | best_r 40.6362 | argmax_r 33.2858 | avg_r 32.3581
| epoch 001 | valid on 'valid' subset | loss 4.815 | nll_loss -0.005 | ppl 1.00 | num_updates 6240 | best_loss 4.81498 | best_r 40.6449 | argmax_r 33.348 | avg_r 32.3751
| epoch 001 | valid on 'valid' subset | loss 4.806 | nll_loss -0.005 | ppl 1.00 | num_updates 6260 | best_loss 4.80625 | best_r 40.6411 | argmax_r 33.3622 | avg_r 32.3826
| epoch 001 | valid on 'valid' subset | loss 4.802 | nll_loss -0.005 | ppl 1.00 | num_updates 6280 | best_loss 4.80233 | best_r 40.574 | argmax_r 33.4305 | avg_r 32.3658
| epoch 001 | valid on 'valid' subset | loss 4.832 | nll_loss -0.005 | ppl 1.00 | num_updates 6300 | best_loss 4.83246 | best_r 40.6468 | argmax_r 33.3849 | avg_r 32.3796
| epoch 001 | valid on 'valid' subset | loss 4.829 | nll_loss -0.005 | ppl 1.00 | num_updates 6320 | best_loss 4.82865 | best_r 40.637 | argmax_r 33.373 | avg_r 32.3701
| epoch 001 | valid on 'valid' subset | loss 4.822 | nll_loss -0.005 | ppl 1.00 | num_updates 6340 | best_loss 4.82152 | best_r 40.6558 | argmax_r 33.3819 | avg_r 32.3735
| epoch 001 | valid on 'valid' subset | loss 4.809 | nll_loss -0.005 | ppl 1.00 | num_updates 6360 | best_loss 4.80875 | best_r 40.6434 | argmax_r 33.3675 | avg_r 32.3604
| epoch 001 | valid on 'valid' subset | loss 4.816 | nll_loss -0.005 | ppl 1.00 | num_updates 6380 | best_loss 4.81608 | best_r 40.6941 | argmax_r 33.4451 | avg_r 32.3944
| epoch 001 | valid on 'valid' subset | loss 4.811 | nll_loss -0.005 | ppl 1.00 | num_updates 6400 | best_loss 4.81114 | best_r 40.6802 | argmax_r 33.4501 | avg_r 32.3955
| epoch 001 | valid on 'valid' subset | loss 4.792 | nll_loss -0.005 | ppl 1.00 | num_updates 6420 | best_loss 4.79219 | best_r 40.6714 | argmax_r 33.4696 | avg_r 32.4014
| epoch 001 | valid on 'valid' subset | loss 4.784 | nll_loss -0.005 | ppl 1.00 | num_updates 6440 | best_loss 4.78441 | best_r 40.6051 | argmax_r 33.4705 | avg_r 32.3638
| epoch 001 | valid on 'valid' subset | loss 4.786 | nll_loss -0.005 | ppl 1.00 | num_updates 6460 | best_loss 4.78562 | best_r 40.5453 | argmax_r 33.4697 | avg_r 32.3473
| epoch 001 | valid on 'valid' subset | loss 4.771 | nll_loss -0.005 | ppl 1.00 | num_updates 6480 | best_loss 4.77136 | best_r 40.5229 | argmax_r 33.46 | avg_r 32.3669
| epoch 001 | valid on 'valid' subset | loss 4.777 | nll_loss -0.005 | ppl 1.00 | num_updates 6500 | best_loss 4.7773 | best_r 40.4912 | argmax_r 33.404 | avg_r 32.3365
| epoch 001 | valid on 'valid' subset | loss 4.811 | nll_loss -0.005 | ppl 1.00 | num_updates 6520 | best_loss 4.81068 | best_r 40.6471 | argmax_r 33.4461 | avg_r 32.3571
| epoch 001 | valid on 'valid' subset | loss 4.792 | nll_loss -0.005 | ppl 1.00 | num_updates 6540 | best_loss 4.79208 | best_r 40.6065 | argmax_r 33.4101 | avg_r 32.3603
| epoch 001 | valid on 'valid' subset | loss 4.796 | nll_loss -0.004 | ppl 1.00 | num_updates 6560 | best_loss 4.79647 | best_r 40.5878 | argmax_r 33.3673 | avg_r 32.3407
| epoch 001 | valid on 'valid' subset | loss 4.785 | nll_loss -0.005 | ppl 1.00 | num_updates 6580 | best_loss 4.78495 | best_r 40.5563 | argmax_r 33.3576 | avg_r 32.3181
| epoch 001 | valid on 'valid' subset | loss 4.801 | nll_loss -0.005 | ppl 1.00 | num_updates 6600 | best_loss 4.80101 | best_r 40.632 | argmax_r 33.4392 | avg_r 32.3621
| epoch 001 | valid on 'valid' subset | loss 4.785 | nll_loss -0.005 | ppl 1.00 | num_updates 6620 | best_loss 4.78539 | best_r 40.5896 | argmax_r 33.4113 | avg_r 32.3383
| epoch 001 | valid on 'valid' subset | loss 4.776 | nll_loss -0.005 | ppl 1.00 | num_updates 6640 | best_loss 4.776 | best_r 40.5578 | argmax_r 33.4056 | avg_r 32.3653
| epoch 001 | valid on 'valid' subset | loss 4.754 | nll_loss -0.005 | ppl 1.00 | num_updates 6660 | best_loss 4.75374 | best_r 40.515 | argmax_r 33.3809 | avg_r 32.3403
| epoch 001 | valid on 'valid' subset | loss 4.758 | nll_loss -0.005 | ppl 1.00 | num_updates 6680 | best_loss 4.75789 | best_r 40.5609 | argmax_r 33.417 | avg_r 32.375
| epoch 001 | valid on 'valid' subset | loss 4.777 | nll_loss -0.005 | ppl 1.00 | num_updates 6700 | best_loss 4.77729 | best_r 40.5595 | argmax_r 33.4266 | avg_r 32.3801
| epoch 001 | valid on 'valid' subset | loss 4.786 | nll_loss -0.005 | ppl 1.00 | num_updates 6720 | best_loss 4.78556 | best_r 40.6257 | argmax_r 33.4183 | avg_r 32.4058
| epoch 001 | valid on 'valid' subset | loss 4.783 | nll_loss -0.005 | ppl 1.00 | num_updates 6740 | best_loss 4.78252 | best_r 40.5779 | argmax_r 33.4443 | avg_r 32.3558
| epoch 001 | valid on 'valid' subset | loss 4.809 | nll_loss -0.004 | ppl 1.00 | num_updates 6760 | best_loss 4.80897 | best_r 40.6099 | argmax_r 33.4536 | avg_r 32.3575
| epoch 001 | valid on 'valid' subset | loss 4.768 | nll_loss -0.005 | ppl 1.00 | num_updates 6780 | best_loss 4.76818 | best_r 40.5463 | argmax_r 33.4245 | avg_r 32.3583
| epoch 001 | valid on 'valid' subset | loss 4.791 | nll_loss -0.005 | ppl 1.00 | num_updates 6800 | best_loss 4.79144 | best_r 40.5923 | argmax_r 33.4441 | avg_r 32.3705
| epoch 001 | valid on 'valid' subset | loss 4.788 | nll_loss -0.004 | ppl 1.00 | num_updates 6820 | best_loss 4.78819 | best_r 40.5818 | argmax_r 33.4013 | avg_r 32.3907
| epoch 001 | valid on 'valid' subset | loss 4.787 | nll_loss -0.004 | ppl 1.00 | num_updates 6840 | best_loss 4.78671 | best_r 40.5994 | argmax_r 33.4618 | avg_r 32.3971
| epoch 001 | valid on 'valid' subset | loss 4.786 | nll_loss -0.005 | ppl 1.00 | num_updates 6860 | best_loss 4.78564 | best_r 40.6277 | argmax_r 33.4476 | avg_r 32.422
| epoch 001 | valid on 'valid' subset | loss 4.808 | nll_loss -0.005 | ppl 1.00 | num_updates 6880 | best_loss 4.80799 | best_r 40.6863 | argmax_r 33.4844 | avg_r 32.4368
| epoch 001 | valid on 'valid' subset | loss 4.807 | nll_loss -0.004 | ppl 1.00 | num_updates 6900 | best_loss 4.8066 | best_r 40.6478 | argmax_r 33.4502 | avg_r 32.435
| epoch 001 | valid on 'valid' subset | loss 4.790 | nll_loss -0.005 | ppl 1.00 | num_updates 6920 | best_loss 4.78986 | best_r 40.6356 | argmax_r 33.4687 | avg_r 32.4259
| epoch 001 | valid on 'valid' subset | loss 4.784 | nll_loss -0.005 | ppl 1.00 | num_updates 6940 | best_loss 4.78364 | best_r 40.5846 | argmax_r 33.4294 | avg_r 32.4003
| epoch 001 | valid on 'valid' subset | loss 4.794 | nll_loss -0.005 | ppl 1.00 | num_updates 6960 | best_loss 4.794 | best_r 40.6545 | argmax_r 33.4215 | avg_r 32.4088
| epoch 001 | valid on 'valid' subset | loss 4.787 | nll_loss -0.004 | ppl 1.00 | num_updates 6980 | best_loss 4.78725 | best_r 40.6432 | argmax_r 33.386 | avg_r 32.4111
| epoch 001 | valid on 'valid' subset | loss 4.785 | nll_loss -0.004 | ppl 1.00 | num_updates 7000 | best_loss 4.78471 | best_r 40.6223 | argmax_r 33.3942 | avg_r 32.4005
| epoch 001: 7000 / 11745 loss=5.412, nll_loss=-0.008, ppl=0.99, wps=1985, ups=0, wpb=5428.760, bsz=186.432, num_updates=7001, lr=0.0001, gnorm=7.684, clip=0.000, oom=0.000, wall=19211, train_wall=9933, best_r=45.5885, argmax_r=37.3294, avg_r=36.0458
| epoch 001 | valid on 'valid' subset | loss 4.787 | nll_loss -0.005 | ppl 1.00 | num_updates 7020 | best_loss 4.78709 | best_r 40.6157 | argmax_r 33.3885 | avg_r 32.3892
| epoch 001 | valid on 'valid' subset | loss 4.772 | nll_loss -0.004 | ppl 1.00 | num_updates 7040 | best_loss 4.77184 | best_r 40.5752 | argmax_r 33.3985 | avg_r 32.3844
| epoch 001 | valid on 'valid' subset | loss 4.794 | nll_loss -0.004 | ppl 1.00 | num_updates 7060 | best_loss 4.79376 | best_r 40.5621 | argmax_r 33.4199 | avg_r 32.3519
| epoch 001 | valid on 'valid' subset | loss 4.804 | nll_loss -0.004 | ppl 1.00 | num_updates 7080 | best_loss 4.80376 | best_r 40.6089 | argmax_r 33.4286 | avg_r 32.3702
| epoch 001 | valid on 'valid' subset | loss 4.787 | nll_loss -0.004 | ppl 1.00 | num_updates 7100 | best_loss 4.78736 | best_r 40.5905 | argmax_r 33.3359 | avg_r 32.3731
| epoch 001 | valid on 'valid' subset | loss 4.783 | nll_loss -0.004 | ppl 1.00 | num_updates 7120 | best_loss 4.7831 | best_r 40.5774 | argmax_r 33.4007 | avg_r 32.3648
| epoch 001 | valid on 'valid' subset | loss 4.783 | nll_loss -0.004 | ppl 1.00 | num_updates 7140 | best_loss 4.78342 | best_r 40.5973 | argmax_r 33.4108 | avg_r 32.3734
| epoch 001 | valid on 'valid' subset | loss 4.797 | nll_loss -0.004 | ppl 1.00 | num_updates 7160 | best_loss 4.7974 | best_r 40.6008 | argmax_r 33.385 | avg_r 32.3664
| epoch 001 | valid on 'valid' subset | loss 4.770 | nll_loss -0.004 | ppl 1.00 | num_updates 7180 | best_loss 4.76965 | best_r 40.5604 | argmax_r 33.3454 | avg_r 32.3687
| epoch 001 | valid on 'valid' subset | loss 4.774 | nll_loss -0.004 | ppl 1.00 | num_updates 7200 | best_loss 4.77389 | best_r 40.5349 | argmax_r 33.3096 | avg_r 32.3436
| epoch 001 | valid on 'valid' subset | loss 4.766 | nll_loss -0.004 | ppl 1.00 | num_updates 7220 | best_loss 4.7661 | best_r 40.5394 | argmax_r 33.2899 | avg_r 32.3564
early stop
| epoch 001 | loss 5.414 | nll_loss -0.008 | ppl 0.99 | wps 1984 | ups 0 | wpb 5428.111 | bsz 186.423 | num_updates 7220 | lr 0.0001 | gnorm 7.682 | clip 0.000 | oom 0.000 | wall 19818 | train_wall 10239 | best_r 45.6022 | argmax_r 37.342 | avg_r 36.0586
| epoch 001 | valid on 'valid' subset | loss 4.766 | nll_loss -0.004 | ppl 1.00 | num_updates 7220 | best_loss 4.7661 | best_r 40.5394 | argmax_r 33.2899 | avg_r 32.3564
nsml save for epoch 1
| done training in 19826.6 seconds
User session exited
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment