Motivation
During the investigation of DPU offload, Slurm is used as workload management system to integrate
with offload daemons. The document give a short description on how to install Slum with two Ubuntu20.04 hosts,
and enable prolog/epilog to launch/destroy offload daemons.
Installation
$ sudo apt install -y slurmctld slurmd
Generate slurm.conf
Build a configuration file using your favorite web browser and the Slurm Configuration Tool.
NOTE: set SlurmUser
to root
will make setup/configuration easier.
cat << EOF | sudo tee /etc/slurm-llnl/slurm.conf
<the output of slurm configurator>
EOF
Start control plane
$ sudo systemctl start slurmctld
Start compute node
$ sudo systemctl start slurmd
Verification
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
dev* up infinite 1 idle hpc-02
Trouble Shooting
Failed to start control plane because of Invalid SelectTypeParameters
When slurm control plane start and got the following errors, it’s required to deploy Munge keys into every node.
[2022-12-28T14:24:37.900] select/cons_tres: select_p_node_init
[2022-12-28T14:24:37.900] fatal: Invalid SelectTypeParameters: NONE (0), You need at least CR_(CPU|CORE|SOCKET)*
Add the following configuration into /etc/slurm-llnl/slurm.conf
in control plane node.
SelectTypeParameters=CR_CPU
Failed to start compute node
When start compute node and got the following errors, it requires additional configuration of cgroup for compute node.
[2022-12-29T08:03:59.906] error: cgroup namespace 'freezer' not mounted. aborting
[2022-12-29T08:03:59.906] error: unable to create freezer cgroup namespace
[2022-12-29T08:03:59.906] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed
[2022-12-29T08:03:59.906] error: cannot create proctrack context for proctrack/cgroup
The following command will generate the cgroup
configuration for compute node.
$ cat << EOF | sudo tee /etc/slurm-llnl/cgroup.conf
CgroupAutomount=yes
ConstrainCores=no
ConstrainRAMSpace=no
EOF
Failed to authenticate compute nodes
When slurm control plane start and got the following errors, it’s required to deploy Munge keys into every node.
[2022-12-29T03:58:02.770] error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
[2022-12-29T03:58:02.770] error: slurm_unpack_received_msg: Protocol authentication error
[2022-12-29T03:58:02.780] error: slurm_receive_msg [10.209.226.195:36432]: Unspecified error
The following command will re-generate Munge key, and leverage scp
to copy it into every node in the cluster.
$ sudo create-munge-key # generate munge key
$ scp /etc/munge/munge.key <all nodes of cluster> # copy munge.key into every node of the cluster
References
comments powered by