IFT Hydra Tutorial

Main Menu

Written by Administrator

Friday, 27 May 2022 11:00

Some hints:

Hydra provides some software and compilers. Remarkable ones are:

gnu compiler version 9.4
intel oneapi 2021
openmpi 4.1.1
hdf5, valgrind, boost and much more....

All this software is accessible through a module system. Those non familiar with it can check the official documentation out:
https://lmod.readthedocs.io/en/latest/010_user.html

But they only need to know that modules allow users to dynamically modify the user's environment, by loading and unloading the pieces of software of the need. So, the following commands are useful:
"module avail" shows the list of available modules
"module list" shows the list of user's loaded modules
"module load/unload modulename" loads/unloads module modulename
“module purge” unloads all modules
“module –show_hidden spider” list all modules, including hidden ones.
"module help" shows help

Default environment (if no further modules are loaded/unloaded) is GNU compiler + autotools + openmpi libraries.

I have also installed some extra software (in extsoft) that includes Miniconda, Mathematica and a newer version of the GNU compiler and openmpi toolchain. If you are going to use it purge the loaded modules first just in case. For miniconda, at the moment, only me can install software so if you want an enviroment available, tell me the steps to create it and I will do it for you with hopefully the side effect that everybody in the cluster could use the same enviroment for their simulations.

There is not a lot of space in /home. Use the /lustre partition instead (If you plan to use your own conda configure your package and enviroment directories there). You should have a directory for your user. Lustre is not a repository and it’s not backed up often. It should be below 80% full. After finishing your current research, backup and remove your data Update: Lustre is down indefinitely, use /hydrarepo instead.

Queue system is SLURM. Those interested can check official docs out:
https://slurm.schedmd.com/quickstart.html

The most important thing is that now there are two alternative ways for job execution:
Interactive: May be useful when debugging and test
Batch: The regular way to submit jobs to a queue

Have a look to the detailed documentation for each one, with example programs and scripts: Interactive
Batch

Check info about the cluster and the queue:
To gather info about the cluster or nodes use the commands sinfo -l or sinfo -Nl. If you want to see the queue, use the command squeue.

Hydra's running modes:

Serial
Modify the batch script setting N=1 and n=1 (It’s the default so you can leave it blank if you didn’t define them in the script).
As memory is also a resource it will be convinient to use the option --mem=xG where x is the number of GB. By default it will use 2 GB per CPU.
If you want to launch the same serial program indepently multiple times you can use the array option:
--array=0-299 (this option is like launching the script 300 times manually). More info about arrays here:
https://slurm.schedmd.com/job_array.html
MPD
If you want to use all cores in a node but using nodes just one-by-one (no mpi parallelization)
Modify the batch script setting N=1 and n=1. Now submit the script to the queue with the important option:
sbatch --exclusive job.script
--exclusive option will reserve the whole node for your run
In the case you want to use only some cores you can add the option -c 8 to use eight cores instead of the whole node.
MPI
If you just want to use MPI but not multicore parallelization:
Modify the batch script to set N="number of nodes you want to use" and n="number of MPI tasks"
In the case of MPI if you have some special memory constraints It would be better to use the memory option –mem-per-cpu=xG where x is the number of RAM GB allocated to each MPI task.
MPI+MPD
If you want to run a code using both MPI and multicore:
Modify the batch script to set N="number of nodes you want to use" and n="number of MPI tasks"
Now submit the script to the queue with the important:
sbatch --exclusive job.script
to reserve all cores in a node.

Multithreading
By default, you can only use one thread per core, if you want to use hyperthreading with mpi you have to add the option --ntasks-per-core=2.

Canceling jobs
To cancel a job use scancel -j job_id.

Check job info
If you want to analyse your job (to see the allocated and consumed resources or the elapsed time) you can use the sacct command.
sacct -l –units=G -j 123
That order will display a long list with all the info of the job 123 with the units in GB. For more information about the data you can check with this command read the documentation:
https://slurm.schedmd.com/sacct.html

Last Updated on Friday, 27 May 2022 12:30