咖啡香

Slurm 1: Setup Slurm with two Ubuntu20.04 hosts

Motivation

During the investigation of DPU offload, Slurm is used as workload management system to integrate with offload daemons. The document give a short description on how to install Slum with two Ubuntu20.04 hosts, and enable prolog/epilog to launch/destroy offload daemons.

Installation

$ sudo apt install -y slurmctld slurmd

Generate slurm.conf

Build a configuration file using your favorite web browser and the Slurm Configuration Tool.

NOTE: set SlurmUser to root will make setup/configuration easier.

cat << EOF | sudo tee /etc/slurm-llnl/slurm.conf
<the output of slurm configurator>
EOF

Start control plane

$ sudo systemctl start slurmctld

Start compute node

$ sudo systemctl start slurmd

Verification

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
dev*         up   infinite      1   idle hpc-02

Trouble Shooting

Failed to start control plane because of Invalid SelectTypeParameters

When slurm control plane start and got the following errors, it’s required to deploy Munge keys into every node.

[2022-12-28T14:24:37.900] select/cons_tres: select_p_node_init
[2022-12-28T14:24:37.900] fatal: Invalid SelectTypeParameters: NONE (0), You need at least CR_(CPU|CORE|SOCKET)*

Add the following configuration into /etc/slurm-llnl/slurm.conf in control plane node.

SelectTypeParameters=CR_CPU

Failed to start compute node

When start compute node and got the following errors, it requires additional configuration of cgroup for compute node.

[2022-12-29T08:03:59.906] error: cgroup namespace 'freezer' not mounted. aborting
[2022-12-29T08:03:59.906] error: unable to create freezer cgroup namespace
[2022-12-29T08:03:59.906] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed
[2022-12-29T08:03:59.906] error: cannot create proctrack context for proctrack/cgroup

The following command will generate the cgroup configuration for compute node.

$ cat << EOF | sudo tee /etc/slurm-llnl/cgroup.conf
CgroupAutomount=yes
ConstrainCores=no
ConstrainRAMSpace=no
EOF

Failed to authenticate compute nodes

When slurm control plane start and got the following errors, it’s required to deploy Munge keys into every node.

[2022-12-29T03:58:02.770] error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
[2022-12-29T03:58:02.770] error: slurm_unpack_received_msg: Protocol authentication error
[2022-12-29T03:58:02.780] error: slurm_receive_msg [10.209.226.195:36432]: Unspecified error

The following command will re-generate Munge key, and leverage scp to copy it into every node in the cluster.

$ sudo create-munge-key # generate munge key
$ scp /etc/munge/munge.key <all nodes of cluster> # copy munge.key into every node of the cluster

References


comments powered by Disqus