This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff78790823e72098e6057528dc01000000 Worker ID: 8683ec6d11263c22dd0d9b2e020e9dc0a21892e1351e41c864d93940 Node ID: 0c1a86314b69f902942ec5c0678a43307476ec8862da5d23e9e634eb Worker IP address: 172.20.90.62 Worker port: 40229 Worker PID: 1032972 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. | |
(RayTrainWorker pid=1032972) Loading cached queryfile '/tmp/.cache/reve-dataloader/training-ready-laion5B_datacompxl_1024px_aesthetics_with_geminiv2.db'. [repeated 7x across cluster] | |
(TorchTrainer pid=1027792) Worker 5 has failed. | |
2024-08-26 11:36:58,674 ERROR |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
create_and_copy() { | |
file_path="/cns/li-d/home/zhanghan_brain/brain/gan/model_eval_all/" | |
eval_folder="$file_path$1/eval" | |
command1="fileutil mkdir -p $eval_folder --gfs_user zhanghan_brain" | |
echo $command1 | |
$command1 | |
echo |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import os | |
import time | |
import numpy as np | |
import tensorflow as tf | |
from tqdm import tqdm | |
from depot import inits | |
from depot.utils import find_trainable_variables, find_variables, iter_data, shuffle |