Skip to content

Instantly share code, notes, and snippets.

@bollig
Forked from sean-smith/Spack_Binary_cache.md
Last active February 9, 2021 16:09
Show Gist options
  • Save bollig/71383f92143ed6b006e5c3892343fef8 to your computer and use it in GitHub Desktop.
Save bollig/71383f92143ed6b006e5c3892343fef8 to your computer and use it in GitHub Desktop.
Create a Cluster with Spack Binary Cache

Create a Cluster with Spack Binary Cache

This binary cache is a subset of the Exascale Computing Project's Extreme-Scale Scientific Software Stack (E4S) (https://oaciss.uoregon.edu/ecp/).

package install command working?
openfoam spack install --no-check-signature --cache-only openfoam
gromacs spack install --no-check-signature --cache-only gromacs
gromacs without SLURM/PMI support spack install --no-check-signature --cache-only gromacs ^openmpi~pmi schedulers=none
ior spack install --no-check-signature --cache-only ior
osu-micro-benchmarks spack install --no-check-signature --cache-only osu-micro-benchmarks
  1. Create a Cluster, I used the following config:

Important You must have s3_read_resource = arn:aws:s3:::*

[aws]
aws_region_name = ${AWS_DEFAULT_REGION}

[global]
cluster_template = default
update_check = false
sanity_check = true

[cluster default]
key_name = ${AWS_DEFAULT_REGION}
vpc_settings = public
base_os = alinux2
ebs_settings = myebs
compute_instance_type = c5.18xlarge
master_instance_type = c5n.2xlarge
cluster_type = ondemand
placement_group = DYNAMIC
placement = compute
max_queue_size = 8
initial_queue_size = 0
disable_hyperthreading = true
scheduler = slurm
s3_read_resource = arn:aws:s3:::*

[vpc public]
vpc_id = ${vpc_id}
master_subnet_id = ${master_subnet_id}
compute_subnet_id = ${compute_subnet_id}

[ebs myebs]
shared_dir = /shared
volume_type = gp2
volume_size = 20

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
  1. Install Spack
sudo su 
export SPACK_ROOT=/shared/spack
mkdir -p $SPACK_ROOT
git clone https://github.com/spack/spack $SPACK_ROOT
cd $SPACK_ROOT
git checkout releases/v0.15
echo "export SPACK_ROOT=$SPACK_ROOT" > /etc/profile.d/spack.sh
echo "source $SPACK_ROOT/share/spack/setup-env.sh" >> /etc/profile.d/spack.sh
exit
source /etc/profile.d/spack.sh
sudo chown -R $USER:$USER $SPACK_ROOT

Verify the install:

spack -V
0.15.4
  1. Add the environment
mv $SPACK_ROOT/etc/spack/packages.yaml $HOME/bak_packages.yaml 
mkdir -p $SPACK_ROOT/var/spack/environments/aws
wget https://gist.githubusercontent.com/bollig/71383f92143ed6b006e5c3892343fef8/raw/2_spack.yaml -O $SPACK_ROOT/var/spack/environments/aws/spack.yaml
  1. Activate the environment
$ spack env list
aws
$ spack env activate aws
$ spack concretize
  1. Install Python 3 and Boto3
sudo yum install -y python3
sudo pip3 install boto3
  1. Install packages!

NOTE: when you install packages within the aws spack environment, they are installed globally. You can activate them as modules later without loading the spack environment. The environment ensures that your spack configuration exactly matches the CI/CD pipeline when installing packages.

spack env activate aws
spack install --no-check-signature ior
spack env deactivate

image

  1. Test your packages:
module load osu-micro-benchmarks
srun -N 2 --ntasks-per-node=1 ior -w -r -o=/scratch/test_dir -b=256m -a=POSIX -i=5 -F -z -t=64m -C
  1. Confirm EFA support (optional):

    a. update your ParallelCluster config to use a EFA-enabled compute_instance_type (e.g., c5n.18xlarge) and add enable_efa = compute.

    image

    b. update your cluster with:

    pcluster stop -c config.ini cluster_name
    pcluster update -c config.ini cluster_name

    c. ssh back to the cluster, and run:

    salloc -N 2 --tasks-per-node=1
    srun -N 2 --ntasks-per-node=1 --pty bash

    d. inside the interactive prompt:

    module load osu-micro-benchmarks
    fi_info -l
    # EFA enabled:
    srun -N 2 --ntasks-per-node=1 osu_bw
    # EFA disabled:
    FI_PROVIDER=^efa srun -N 2 --ntasks-per-node=1 osu_bw
    

FAQs

  1. My packages are installing from source - help!

Note: patchelf always installs from source - this is because it's a spack dependency and not a package dependency.

There's 2 reasons why this may happen:

a. No access to the S3 mirror. Run:

$ aws s3 ls s3://spack-mirrors/amzn2-e4s
An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

If you see "Access Denied" then add the AmazonS3ReadOnlyAccess to your instance's iam profile, or add s3_read_resource = arn:aws:s3:::* to your cluster's config and run update.

b. Environment isn't activated

Check spack env list and make sure aws is in green, if not run spack env activate aws

  1. OpenMPI, Libfabric, SLURM or other packages always use the local modules or paths: This is likely because you have a packages.yaml configured globally. Use spack config blame packages to identify which config files are impacting the configuration. For example, if you see
[...]
/shared/spack-0.15/var/spack/environments/aws/spack.yaml:139    slurm:
/shared/spack-0.15/var/spack/environments/aws/spack.yaml:154      paths:
/shared/spack-0.15/etc/spack/packages.yaml:12                       [email protected] +pmix: /opt/slurm/
/shared/spack-0.15/var/spack/environments/aws/spack.yaml:152      buildable: True
[...]

you can either extend your $SPACK_ROOT/var/spack/environments/aws/spack.yaml to disable the paths/modules (e.g, slurm: { paths: {[email protected] +pmix: null} }), or simply remove the offending packages.yaml (preferred).

NoTears HPC Users: rm $SPACK_ROOT/etc/spack/packages.yaml

  1. If you see this error after installing a package:
==> Error: Failed to install XXXXXXXXXX due to ModuleNotFoundError: No module named 'botocore'
==> Error: No module named 'botocore'

, it is caused by a the spack installed python package overriding the system-wide python installed via yum. Do the following:

spack install --no-cache py-pip
pip3 install boto3

Then rerun your desired package install commands.

  1. If you see this error:
srun: error: _parse_next_key: Parsing error at unrecognized key: NodeSet
srun: error: Parse error in file /opt/slurm/etc/pcluster/slurm_parallelcluster_compute_partition.conf line 5: "NodeSet=compute_nodes Nodes=compute-dy-c5n18xlarge-[1-10]"
srun: error: "Include" failed in file /opt/slurm/etc/slurm_parallelcluster.conf line 8
srun: error: "Include" failed in file /opt/slurm/etc/slurm.conf line 70
srun: fatal: Unable to process configuration file

then module unload slurm before running any SLURM commands (srun, squeue, etc.). The module version of SLURM does not match the ParallelCluster provided version, and some syntax of the config file is unsupported.

spack:
config:
install_missing_compilers: true
mirrors: { "mirror": "s3://spack-mirrors/amzn2-e4s" }
packages:
all:
providers:
blas:
- openblas
mpi:
- openmpi
- mpich
variants: +mpi
binutils:
variants: +gold+headers+libiberty~nls
version:
- 2.33.1
openfoam:
version:
- 2012
- 2006
paraview:
variants: +qt+python3
qt:
variants: +opengl
ncurses:
variants: +termlib
sqlite:
variants: +column_metadata
hdf5:
variants: +hl
mesa:
# Will not work for graviton2; need a newer version of mesa for ARM
variants: swr=avx,avx2
version:
- 18.3.6
llvm:
version:
- 11.0.0
hwloc:
version: [2.4.0]
munge:
# Refer to ParallelCluster global munge space
variants: localstatedir=/var
openmpi:
variants: fabrics=ofi +pmi +legacylaunchers schedulers=slurm
version: [4.1.0]
intel-mpi:
version: [2020.2.254]
slurm:
variants: +pmix sysconfdir=/opt/slurm/etc
version: [20-02-4-1]
libfabric:
variants: fabrics=efa,tcp,udp,sockets,verbs,shm,mrail,rxd,rxm
version: [1.11.1]
mpich:
# For EFA (requires ch4)
variants: ~wrapperrpath pmi=pmi netmod=ofi device=ch4
libevent:
version: [2.1.8]
openblas:
version: [0.3.10]
modules:
enable:
- tcl
prefix_inspections:
bin:
- PATH
man:
- MANPATH
share/man:
- MANPATH
share/aclocal:
- ACLOCAL_PATH
lib:
- LIBRARY_PATH
lib64:
- LIBRARY_PATH
include:
- CPATH
lib/pkgconfig:
- PKG_CONFIG_PATH
lib64/pkgconfig:
- PKG_CONFIG_PATH
share/pkgconfig:
- PKG_CONFIG_PATH
'':
- CMAKE_PREFIX_PATH
tcl:
verbose: True
hash_length: 6
projections:
all: '{name}/{version}-{compiler.name}-{compiler.version}'
^libfabric: '{name}/{version}-{^mpi.name}-{^mpi.version}-{^libfabric.name}-{^libfabric.version}-{compiler.name}-{compiler.version}'
^mpi: '{name}/{version}-{^mpi.name}-{^mpi.version}-{compiler.name}-{compiler.version}'
whitelist:
- gcc
blacklist:
- slurm
all:
conflict:
- '{name}'
suffixes:
'^openblas': openblas
'^netlib-lapack': netlib
filter:
environment_blacklist: ['CPATH', 'LIBRARY_PATH']
environment:
set:
'{name}_ROOT': '{prefix}'
autoload: direct
gcc:
environment:
set:
CC: gcc
CXX: g++
FC: gfortran
F90: gfortran
F77: gfortran
openmpi:
environment:
set:
SLURM_MPI_TYPE: "pmix"
OMPI_MCA_btl_tcp_if_exclude: "lo,docker0,virbr0"
miniconda3:
environment:
set:
CONDA_PKGS_DIRS: ~/.conda/pkgs
CONDA_ENVS_PATH: ~/.conda/envs
lmod:
hierarchy:
- mpi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment