咖啡香

Slurm 2: prolog/epilog tutorial

Motivation

During the investigation of DPU offload, Slurm is used as workload management system to integrate with offload daemons. The document give a short description on how to configure Slum prolog/epilog for offload daemons in the future.

Prolog Script

Create prolog/prologSlurmCtld script for slurm; in those scripts, only print the environment values that the script can get for coming operations.

cat << EOF | sudo tee /opt/hpc-cloud/prolog-ctld.sh
#!/bin/bash

echo "$@" > /opt/hpc-cloud/prolog-ctld.log 

env >> /opt/hpc-cloud/prolog-ctld.log
EOF
$ cat << EOF | sudo tee /opt/hpc-cloud/prolog.sh
#!/bin/bash

echo "$@" > /opt/hpc-cloud/prolog.log 

env >> /opt/hpc-cloud/prolog.log
EOF

Update slurm.conf

After creating the scripts, update slurm.conf on both control plane and compute node to enable prolog/prologSlurmctld as following:

Prolog=/opt/hpc-cloud/prolog.sh
PrologSlurmctld=/opt/hpc-cloud/prolog-ctld.sh

Restart control plane

$ sudo systemctl restart slurmctld

Restart compute node

$ sudo systemctl restart slurmd

Verification

Submit a script job to the slurm and verify the output of prolog/prologSlurmCtld. According to the output of prolog/prolgSlurmCtld, only pre-defined environment values are available in the scripts; the other environment values, e.g. DCM_CHARTS=ucc, are not available in the scripts.

$ cat << EOF | tee sleep.sh
#!/bin/bash

#SBATCH -o job.%j.out
#SBATCH -p dev
#SBATCH --qos=low
#SBATCH -J hpc-test
#SBATCH -c 2
#SBATCH -n 5
#SBATCH --export=DCM_CHARTS=ucc

/usr/bin/sleep 10

env
EOF
$ sbatch sleep.sh
$ cat /opt/slurm/prolog.log

SLURM_JOB_USER=klausm
SLURM_JOB_UID=47906
SLURMD_NODENAME=hpc-cloud01
SLURM_CLUSTER_NAME=openbce
PWD=/var/log
SLURM_JOB_PARTITION=dev
SLURM_JOBID=16
SLURM_JOB_CONSTRAINTS=(null)
SLURM_SCRIPT_CONTEXT=prolog_slurmd
SLURM_NODELIST=hpc-cloud01
SLURM_STEP_ID=4294967294
SHLVL=1
SLURM_UID=47906
SLURM_JOB_ID=16
SLURM_CONF=/etc/slurm-llnl/slurm.conf
_=/usr/bin/env
$ cat /opt/hpc-cloud/prolog-ctld.log
SLURM_JOB_USER=klausm
SLURM_JOB_UID=47906
SLURM_CLUSTER_NAME=openbce
PWD=/var/log
SLURM_JOB_PARTITION=dev
SLURM_JOBID=16
SLURM_SCRIPT_CONTEXT=prolog_slurmctld
SLURM_JOB_ACCOUNT=(null)
SHLVL=1
SLURM_JOB_ID=16
SLURM_JOB_NAME=hpc-test
SLURM_JOB_GROUP=dip
SLURM_JOB_GID=30
SLURM_JOB_NODELIST=hpc-cloud01
_=/usr/bin/env

References


comments powered by Disqus