Skip to content

Instantly share code, notes, and snippets.

A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff78790823e72098e6057528dc01000000 Worker ID: 8683ec6d11263c22dd0d9b2e020e9dc0a21892e1351e41c864d93940 Node ID: 0c1a86314b69f902942ec5c0678a43307476ec8862da5d23e9e634eb Worker IP address: 172.20.90.62 Worker port: 40229 Worker PID: 1032972 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(RayTrainWorker pid=1032972) Loading cached queryfile '/tmp/.cache/reve-dataloader/training-ready-laion5B_datacompxl_1024px_aesthetics_with_geminiv2.db'. [repeated 7x across cluster]
(TorchTrainer pid=1027792) Worker 5 has failed.
2024-08-26 11:36:58,674 ERROR
#!/bin/bash
create_and_copy() {
file_path="/cns/li-d/home/zhanghan_brain/brain/gan/model_eval_all/"
eval_folder="$file_path$1/eval"
command1="fileutil mkdir -p $eval_folder --gfs_user zhanghan_brain"
echo $command1
$command1
echo
@hanzhanggit
hanzhanggit / gan_normalization.py
Created December 15, 2017 02:00
GAN with normalization
import os
import time
import numpy as np
import tensorflow as tf
from tqdm import tqdm
from depot import inits
from depot.utils import find_trainable_variables, find_variables, iter_data, shuffle