nov05 / 20250205_udacity_nd189_capstone_training_issues.md

Last active February 8, 2025 06:40

⚠️🟢 Issue: training error

[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6[1,mpirank:0,algo-1]<stderr>:,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
...
[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 675, in <module>
[1,mpirank:1,algo-2]<stdout>:    main(task)
[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 572, in main

nov05 / 20250205_VSCdoe remove Jupyter kernels.md

Last active February 6, 2025 00:27

Uninstall all VS Code extensions
Delete C:\Users\*\.vscode\extensions folder
Reinstall extensions
Remove Jupyter kernels

(base) PS D:\github\udacity-nd009t-capstone-starter> jupyter kernelspec list
Available kernels:

nov05 / 20250128_AWS S3 data to SageMaker machine learning training.md

Last active January 31, 2025 22:52

🟢 AWS S3 data to SageMaker machine learning training

WebDataset source code
https://github.com/webdataset/webdataset

Code snippets are from the following sources:

✅ Why I Chose WebDataset for Training on 50TB of Data?
Ahmad Sachal, May 22, 2023

nov05 / 20250126_AWS SageMaker linear learner distributed training.md

Created January 26, 2025 20:25

To apply distributed training for the AWS SageMaker Linear Learner algorithm, you would typically rely on SageMaker's built-in distributed training capabilities. The Linear Learner algorithm supports distributed training by scaling across multiple instances and using multiple GPUs or CPU cores.

How to Apply Distributed Training for Linear Learner Algorithm in SageMaker

1. Using SageMaker Pre-built Containers with Distributed Training

SageMaker Linear Learner algorithm provides a straightforward approach to use distributed training across multiple instances by setting the instance_count parameter to more than 1.

Steps:

nov05 / 20250126_AWS SageMaker model training and deployment resources.md

Last active January 27, 2025 17:48

🟢 Different Levels of AWS Resources for Machine Learning Model Training and Deployment

👉 EC2 Instances: Full User Control (Least Pre-built Content)
With EC2, you have complete control over the entire setup. You need to:
- Start an EC2 instance (e.g., GPU-enabled for training deep learning models).
- Install dependencies manually (e.g., Python, ML libraries like PyTorch or TensorFlow).
- Copy or configure the training script, and handle the training data management (downloading data from S3 or other sources).
- Run the training process manually using your own code.
- Manage all aspects of the environment, scaling, and resource management.

nov05 / 20241122_AWS SageMaker JupyterLab (or any other IDE), set up GitHub username and password.md

Last active November 24, 2024 11:03

20241122_AWS SageMaker JupyterLab (or any other IDE), set up GitHub username and password

Don't use the email you registered with GitHub for commits. Instead, GitHub provides you with a proxy email for this purpose. Just go to 'Settings - Emails' in your GitHub account, and you'll find the proxy email there.
Don't use your GitHub login password for commits. Instead, go to 'Settings - Developer Settings - Personal access tokens', create a token, and use that as your password for commits. Since Fine-grained tokens are still in Preview, I'm using a classic token for now.

nov05 / 20241119_udacity-aws-mle-nanodegree-env.md

Last active December 6, 2024 01:30

Local Install Requirements

Python 3.7
MXNet 1.8
Pandas >= 1.2.4
AutoGluon 0.2.0

👉 create sagemaker base environment

nov05 / 20240322_reinforcement learning_neural network soft update.md

Last active March 22, 2024 12:22

20240322_reinforcement learning_neural network soft update

"deeprl/agent/DDPG_agent.py"

trg = trg*(1-τ) + src*τ
τ is stored in self.config.target_network_mix

    def soft_update(self, target, source):
        ## trg = trg*(1-τ) + src*τ
        ## τ is stored in self.config.target_network_mix
        for target_param, source_param in zip(target.parameters(), source.parameters()):
 target_param.detach_()

nov05 / 20240225_udacity deep reinforcement learning_py310 env setup.md

Last active November 2, 2024 03:54

👉 Udacity Deep Reinforcement Learning Python Environment Setup

⚠️ Python 3.11 has to be downgraded to Python 3.10, or Multiprocessing will cause TypeError: code() argument 13 must be str, not int in both Windows and Linux. Google Colab is currently using Python 3.10 as well.

Windows 11 (64-bit), VSCode, Powershell, Miniconda3, Python 3.10
repo: https://github.com/Nov05/udacity-deep-reinforcement-learning
working dir: D:\github\ udacity-deep-reinforcement-learning\python
package deeprl is copied and modified from https://github.com/ShangtongZhang/DeepRL/tree/master/ deep_rl into .\python.

nov05 / 20240224_You appear to be missing MuJoCo.md

Created February 24, 2024 17:07

(drlnd_p2) PS D:\github\udacity-deep-reinforcement-learning\python\mujoco-py> python examples\body_interaction.py

You appear to be missing MuJoCo.  We expected to find the file here: C:\Users\*\.mujoco\mujoco210

This package only provides python bindings, the library must be installed separately.

Please follow the instructions on the README to install MuJoCo