TUoS HPC Changelog

New software installed: Nextflow 23.10.0

Type: New

Tags: #Nextflow #Stanage #software #update

Version 23.10.0 of the software Nextflow has been installed on Stanage.

The updated documentation can be found at:

https://docs.hpc.shef.ac.uk/en/latest/stanage/software/apps/nextflow.html

March 12, 2024

Temporary loss of Stanage and Bessemer HPC connectivity on January 17th 2024

Type: New

Tags: #Stanage #Bessemer #maintenance

Due to scheduled network upgrades on January 17th, there will be some short losses of network connectivity to our HPC systems with disruption to external access (between 09:30 - 12:00) and some supporting services (between 09:30 and 15:30).

Note: As this will be brief network disruption, users may experience little to no outages or disruption when connecting via SSH / transferring files however, you may experience:

Hung/terminated SSH sessions to login nodes.
Hung/terminated SCP/SFTP/rsync file transfers to/from login nodes (although rsync may be able to resume from where it left off if then (re)invoked in the right way.)
Hung access to / file transfers to/from ‘/shared’ storage areas on Stanage login nodes and Bessemer login and worker nodes (depending on which /shared area you’re accessing). These connections should automatically resume without the need for any intervention.

If your jobs require access to Sheffield remote shared storage or require network access to start up correctly, you should hold your jobs during the maintenance period, then release them after the maintenance with the following commands for specific job IDs:

scontrol hold <job ID>
scontrol release <job ID>

If you want to hold and release all of your pending jobs, you can do so with the commands:

squeue --me --noheader -t PD --format="%A" | xargs -n 1 -I {} scontrol hold {}
squeue --me --noheader -t PD --format="%A" | xargs -n 1 -I {} scontrol release {}

January 10, 2024

Bessemer GPU driver upgrade

Type: New

Tags: #Bessemer #GPU #maintenance

Bessemer GPU nodes are being upgraded from driver version 525.105.17 with support for CUDA version 12.0, to driver version 535.129.03 with support for CUDA version: 12.2.

January 08, 2024

Stanage maintenance outage and upgrades

Type: New

Tags: #Stanage #GPU #maintenance #outage

A period of offline maintenance for the Stanage cluster will take place from 09:00 on 15th December to 17:00 on 19th December.

What you need to know:

All Stanage login nodes and worker nodes will be unavailable over that period so you won’t be able to run jobs or transfer data during that period.

Leading up to the maintenance period jobs will only start if the amount of runtime they’ve requested is less than the duration until the start of the maintenance period. e.g. if the maintenance period is to start in 48 hours time then batch jobs that request 47 hours of runtime can fit within that window and potentially be executed, depending on how busy job queues are, but jobs that request 49 hours can only be executed after the maintenance period. The same logic applies with interactive sessions too: if you can’t start an interactive session then check that you’re requesting an appropriate amount of run time (e.g. ‘srun –time=02:30:00 –pty bash’ to start an interactive session with 2.5 hours of runtime).

You shouldn’t notice any changes in system behaviour following the maintenance period, although a GPU version update to driver version 525.147.05 (supporting CUDA 12.0) will take place.

The maintenance work should not impact data stored in home directories and /mnt/parscratch (Lustre filesystem) but please remember that neither of these areas are automatically backed up.

December 15, 2023