Skip to content

Instantly share code, notes, and snippets.

View nov05's full-sized avatar
πŸ’­
Homo Sapiens

nov05

πŸ’­
Homo Sapiens
View GitHub Profile
  • ⚠️🟒 Issue: training error
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6[1,mpirank:0,algo-1]<stderr>:,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
...
[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 675, in <module>
[1,mpirank:1,algo-2]<stdout>:    main(task)
[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 572, in main
  • Uninstall all VS Code extensions
    Delete C:\Users\*\.vscode\extensions folder
    Reinstall extensions

  • Remove Jupyter kernels

(base) PS D:\github\udacity-nd009t-capstone-starter> jupyter kernelspec list
Available kernels:

To apply distributed training for the AWS SageMaker Linear Learner algorithm, you would typically rely on SageMaker's built-in distributed training capabilities. The Linear Learner algorithm supports distributed training by scaling across multiple instances and using multiple GPUs or CPU cores.

How to Apply Distributed Training for Linear Learner Algorithm in SageMaker

1. Using SageMaker Pre-built Containers with Distributed Training

SageMaker Linear Learner algorithm provides a straightforward approach to use distributed training across multiple instances by setting the instance_count parameter to more than 1.

Steps:

🟒 Different Levels of AWS Resources for Machine Learning Model Training and Deployment

  1. πŸ‘‰ EC2 Instances: Full User Control (Least Pre-built Content)
    With EC2, you have complete control over the entire setup. You need to:
    • Start an EC2 instance (e.g., GPU-enabled for training deep learning models).
    • Install dependencies manually (e.g., Python, ML libraries like PyTorch or TensorFlow).
    • Copy or configure the training script, and handle the training data management (downloading data from S3 or other sources).
    • Run the training process manually using your own code.
    • Manage all aspects of the environment, scaling, and resource management.
@nov05
nov05 / 20241122_AWS SageMaker JupyterLab (or any other IDE), set up GitHub username and password.md
Last active November 24, 2024 11:03
20241122_AWS SageMaker JupyterLab (or any other IDE), set up GitHub username and password
  • Don't use the email you registered with GitHub for commits. Instead, GitHub provides you with a proxy email for this purpose. Just go to 'Settings - Emails' in your GitHub account, and you'll find the proxy email there.
  • Don't use your GitHub login password for commits. Instead, go to 'Settings - Developer Settings - Personal access tokens', create a token, and use that as your password for commits. Since Fine-grained tokens are still in Preview, I'm using a classic token for now.
  • Local Install Requirements
Python 3.7
MXNet 1.8
Pandas >= 1.2.4
AutoGluon 0.2.0
  • πŸ‘‰ create sagemaker base environment
@nov05
nov05 / 20240322_reinforcement learning_neural network soft update.md
Last active March 22, 2024 12:22
20240322_reinforcement learning_neural network soft update

"deeprl/agent/DDPG_agent.py"

  • trg = trg*(1-Ο„) + src*Ο„
  • Ο„ is stored in self.config.target_network_mix
    def soft_update(self, target, source):
        ## trg = trg*(1-Ο„) + src*Ο„
        ## Ο„ is stored in self.config.target_network_mix
        for target_param, source_param in zip(target.parameters(), source.parameters()):
 target_param.detach_()

πŸ‘‰ Udacity Deep Reinforcement Learning Python Environment Setup

⚠️ Python 3.11 has to be downgraded to Python 3.10, or Multiprocessing will cause TypeError: code() argument 13 must be str, not int in both Windows and Linux. Google Colab is currently using Python 3.10 as well.


(drlnd_p2) PS D:\github\udacity-deep-reinforcement-learning\python\mujoco-py> python examples\body_interaction.py

You appear to be missing MuJoCo.  We expected to find the file here: C:\Users\*\.mujoco\mujoco210

This package only provides python bindings, the library must be installed separately.

Please follow the instructions on the README to install MuJoCo