Skip to content

Instantly share code, notes, and snippets.

@wasertech
Created April 21, 2022 13:24
Show Gist options
  • Save wasertech/a7bd3ae2606e143bf70a540972c3314b to your computer and use it in GitHub Desktop.
Save wasertech/a7bd3ae2606e143bf70a540972c3314b to your computer and use it in GitHub Desktop.
Try with the old training interface: still hangs...
trainer@e4916e93eaab:~/stt$ TF_CUDNN_RESET_RND_GEN_STATE=1 python train.py --show_progressbar true --train_cudnn true --alphabet_config_path /mnt/test_models/alphabet.txt --scorer_path /mnt/lm/kenlm.scorer --feature_cache /tmp/feature_cache --train_files /mnt/extracted/data/cv-fr/clips/train.csv --dev_files /mnt/extracted/data/cv-fr/clips/dev.csv --train_batch_size 32 --dev_batch_size 32 --n_hidden 2048 --epochs 3 --learning_rate 0.0001 --dropout_rate 0.3 --lm_alpha 0.0 --lm_beta 0.0 --log_level=0 --early_stop true --checkpoint_dir /mnt/test2_checkpoints/
Using the top level train.py script is deprecated and will be removed in a future release. Instead use: python -m coqui_stt_training.train
I Performing dummy training to check for memory problems.
I If the following process crashes, you likely have batch sizes that are too big for your available system memory (or GPU memory).
D Session opened.
I Loading best validating checkpoint from /mnt/test2_checkpoints/best_dev-1
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 1 | Loss: 1073.532959
Epoch 0 | Validation | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 351.812256 | Dataset: /mnt/extracted/data/cv-fr/clips/dev.csv
--------------------------------------------------------------------------------
I FINISHED optimization in 0:00:05.373691
D Session closed.
I Dummy run finished without problems, now starting real training process.
D Session opened.
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 1073.532959
Epoch 0 | Validation | Elapsed Time: 0:00:40 | Steps: 248 | Loss: 270.871342 | Dataset: /mnt/extracted/data/cv-fr/clips/dev.csv
I Saved new best validating model with loss 270.871342 to: /mnt/test2_checkpoints/best_dev-2
--------------------------------------------------------------------------------
Epoch 1 | Training | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 414.540070
Epoch 1 | Validation | Elapsed Time: 0:00:35 | Steps: 248 | Loss: 390.220405 | Dataset: /mnt/extracted/data/cv-fr/clips/dev.csv
--------------------------------------------------------------------------------
Epoch 2 | Training | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 477.380676
Epoch 2 | Validation | Elapsed Time: 0:00:35 | Steps: 248 | Loss: 440.691332 | Dataset: /mnt/extracted/data/cv-fr/clips/dev.csv
--------------------------------------------------------------------------------
I FINISHED optimization in 0:01:56.464317
D Session closed.
@wasertech
Copy link
Author

it still hangs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment