Section outline

  • Dear hackathon attendee,
    on Day 2 of the hackathon we will open a survey to get your feedback about the event. The survey is a standard ELIXIR short-term feedback survey and its results will be uploaded to the ELIXIR Training Metrics Database. The survey is anonymous.

    • Connecting to a cluster

      Software tools

      Interactive work on a remote computer
      We connect to the login nodes with a client that supports the Secure SHell (SSH) protocol. The SSH protocol enables a secure remote connection from one computer to another - it offers several options for user authentication and ensures strong data integrity with strong encryption while working. The SSH protocol is used for interactive work on the login node as well as for data transfer to and from the cluster. In Linux, macOS and Windows 10 operating systems, we can establish a connection from the command line (terminal, bash, powershell, cmd), in which we run the ssh program. In Windows 10 (April 2018 update and newer), the SSH protocol is enabled by default, but in older versions it must be enabled (instructions, howtogeek). For older versions of Windows, we need to install the SSH client separately, one of the more established is PuTTY.

      Data transfer
      Secure data transfer from our computer to a remote computer and back also takes place via the SSH protocol, also called SFTP (Secure File Transfer Protocol) or FTP SSH.

      Data can be transferred using the scp program (Secure CoPy), in which commands are written to the command line. This program is installed in operating systems together with the ssh program. For easier work, we can use programs with a graphical interface. FileZilla is available for all the mentioned operating systems, and CyberDuck is also very popular for macOS and Windows operating systems.

      All-in-one tools
      There are a bunch of combined tools for working on remote systems that include support for interactive work and data transfer. For Windows operating systems, the MobaXterm tool is known, on Linux operating systems we can use Snowflake, and on macOS systems (unfortunately only paid) Termius.

      For software developers, we recommend using the Visual Studio Code development environment with the Remote-SSH extension, which is available for all of these operating systems.

      Text editors
      We need a file editor to prepare jobs on the cluster. With data transfer programs and all-in-one tools we can edit also files on a cluster. On Linux and macOS, we use a default program, such as Text Editor, to edit simple text files. It gets a little complicated with Windows, which uses a slightly different format. Unlike Linux and macOS, which complete the line with the LF (Line Feed) character, Windows completes the line with the CR (Carriage Return) and LF characters. We prefer not to use Notepad to edit files on a cluster in Windows, but to install Notepad ++. Before saving the file to a cluster in Notepad ++, change the Windows (CR LF) format to Linux (LF) in the lower right corner.


    • Log in to a cluster

      Run the command line, the simplest way is to press a special key on the keyboard (supertype on Linux, command and space keys on macOS or Windows key on Windows), write “terminal” and click on the proposed program. In the command line of the terminal (the window that opens) write:

      and start the program by pressing the input key (Enter key). Enter the name of your SLING SSO user account instead of <name>. If we are working with another login node, we replace the content after the @ sign accordingly.

      At your first sign in you will receive this note:
      insert yes to add the login node with the specified fingerprint on the PC to known hosts.

      After entering the password 'password' for the user account <name>, we find ourselves at the login node, where this command line is waiting for us:

      Enter hostname to the command line, and thus run the program on the login node, which tells us the name of the remote computer. This is the same as the name of the login node in our case nsc-login.ijs.si.

      We ran our first program on a computer cluster. Of course not yet exactly right.

      To log out of the login node, enter the exit command:

      Login without password

      Transfer files to and from a cluster

      FileZilla
      Start the FileZilla program and enter the data in the input fields below the menu bar: Host: sftp: //nsc-login.ijs.si, Username: <name>, Password: <password> and press Quickconnect. Upon login, we confirm that we trust the server. After a successful login, in the left part of the program we see a tree structure of the file system on the personal computer, and on the right the tree structure of the file system on the computer cluster.

      CyberDuck
      In the CyberDuck toolbar, press the Open Connection button. In the pop-up window in the upper drop-down menu, select the SFTP protocol, and enter the following data: Host: nsc-login.ijs.si, Username:<name>, Password: <password>. Then press the Connect button. Upon login, we confirm that we trust the server. The tree structure of the file system on the computer cluster is displayed.

      Clicking on folders (directories) easily moves us through the file system. In both programs we can see the folder we are currently in written above the tree structure, for example /ceph/grid/home/<name>. Right-clicking opens a menu where you can find commands for working with folders (add, rename, delete) and files (add, rename, edit, delete). In FileZilla, files are transferred between the left and right program windows, and in CyberDuck, between the program window and regular folders. The files are easily transferred by dragging and dropping them with the mouse.

      Working with files directly on the cluster

      You can also enter commands for working with files directly to the command line. Some important ones are:
      • cd (change directory) move through the file system
      1. cd <folder>: move to the entered folder,
      2. cd ..: move back to the parent folder,
      3. cd: move to the base folder of the user account,
      • ls (list) printout of the contents of the folder,
      • pwd (print working directory) the name of the folder we are in,
      • cp (copy) copy files,
      • mv (move) move and rename files,
      • cat <file> display the contents of the file,
      • nano <file> file editing,
      • man <command> help for using the command.
    • Jobs and tasks in the Slurm system

      Users of computer clusters mostly work with middleware for business monitoring, SLURM (Simple Linux Utility for Resource Management). The Slurm system manages the queue, allocates the required resources to the business and monitors the execution of business. With the Slurm system, users provide access to resources (computing nodes) for a certain period of time, start transactions on them and monitor their implementation.

      Jobs

      The user program on the compute nodes is started via the Slurm system. For this purpose, we prepare a transaction in which we state:
      • what programs and files we need to run,
      • how do we call the program,
      • what computer resources do we need to implement,
      • time limit for the execution of the transaction and the like.
      A job that runs on multiple cores at the same time is usually divided into tasks.

      Tasks life cycle

      Once the job is ready, we send it to the queue. The Slurm system then assigns a JOBID to it and puts it on hold. The Slurm system selects queued jobs based on available computing resources, estimated execution time, and set priority.

      When the required resources are available, the transaction starts running. After the execution is complete, the transaction goes through the completing state, when Slurm is waiting for some more nodes, to the completed state.

      If necessary, the job can be suspended or canceled. The job may end in a failure due to execution errors, or the Slurm system may terminate it when the timeout expires.
    • Display cluster information

      Slurm provides a series of commands for working with a cluster. In this section we will look at some examples of using the sinfo, squeue, scontrol, and sacct commands, which serve to display useful information about cluster configuration and status. Detailed information on all commands supported by Slurm can be found on the Slurm project home page.

      Sinfo command

      The command displays information about the state of the cluster, partitions (cluster parts) and nodes, and the available computing capacity. There are a number of switches with which we can more precisely determine the information we want to print about the cluster (documentation).

      Above we can see which logical partitions are available, their status, the time limit of jobs on each partition, and the lists of computing nodes that belong to them. The printout can be customized with the appropriate switches, depending on what we are interested in.
      The above printout tells us the following for each computing node in the cluster: which partition it belongs to (PARTITION), what is its state (STATE), number of cores (CPUS), number of processor slots (S), processor cores in slot (C), machine threads (T), the amount of system memory (MEMORY), and any features (AVAIL_FEATURES) attributed to a given node (e.g., processor type, presence of graphics processing units, etc.). Parts of the cluster can be reserved in advance for various reasons (maintenance work, workshops, projects). Example of listing active reservations on the NSC cluster:
      The above printout shows us any active reservations on the cluster, the duration of the reservation and a list of nodes that are part of the reservation. An individual reservation is assigned a group of users who can use it and thus avoid waiting for the completion of transactions of users who do not have a reservation.

      Squeue command

      In addition to the cluster configuration, we are of course also interested in the state of the job scheduling queue. With the squeue command, we can query for transactions that are currently in the queue, running or have already been successfully or unsuccessfully completed (documentation).

      Print the current status of the transaction type:
      From the printout we can find out the identifier of an individual job, the partition on which it is running, the name of the job, which user started it and the current status of the job.

      The report also returns information about the total time of the job and the list of nodes on which the job is carried out, or the reason why the job has not yet begun. Usually, we are most interested in the state of jobs that we have started ourselves. The printout can be limited to the jobs of the selected user using the --user switch. Example of listing jobs owned by user gen012:
      In addition we can limit the printout to only those jobs that are in a certain state. This is done using the --states switch. Example of a printout of all jobs currently pending execution (PD):

      Scontrol command

      Sometimes we want even more detailed information about a partition, node, or job. We get them with the scontrol command (documentation). Below are some examples of how to use this command.

      Example of printing more detailed information about an individual partition:

      Example of more detailed information about the nsc-fp005 computing node:

      Sacct command

      With the sacct command, we can find out more information about completed jobs and jobs in progress. For example, for a selected user, we can check the status of all tasks over a period of time.
    • Starting jobs on a cluster

      In this chapter we will look at the srun, sbatchn and sallocn commands to start jobs and the scanceln command to cancel the job.

      Srun command

      The simplest way is with the srun command. The command is followed by various switches with which we determine the quantity and type of machine resources that our business needs and various other settings. A detailed explanation of all the options available is available at the link (documentation). We will take a look at some of the most basic and most commonly used.

      To begin with we will run a simple system program hostname as our job, which displays the name of the node on which it runs. Example of starting the hostname program on one of the compute nodes:

      We used the --ntasks = 1 switch on the command line. With it we say that our business consists of a single task; we want a single instance of the hostname program to run. Slurm automatically assigns us one of the processor cores in the cluster and performs jobs on it.

      In the next step, we can try to run several tasks within our job:
      We immediately notice the difference in the printout. Now, four of the same tasks have been performed within our job. They were performed on four different processor cores located on the same computing node (nsc-msv002.ijs.si).

      Of course, our tasks can also be divided between several nodes.

      Our job can always be terminated by pressing Ctrl + C during execution.

      Sbatch command

      The downside of the srun command is that it blocks our command line until our job is completed. In addition, it is awkward to run more complex transactions with a multitude of settings. In such cases, we prefer to use the sbatch command, writing the job settings and individual tasks within our job to the bash script file.

      We have an example of such a script in the box above. At the top of the script we have a comment #! /bin/bash that tells the command line that it is a bash script file. This is followed by line-by-line settings of our job, which always have the prefix #SBATCH. W e have already seen how to determine the reservation, the number of tasks and the number of nodes for our business (--reservation, --ntasks and --nodes) with the srun command.

      Let's explain the other settings:
      • --job-name = my_ job_name: the name of the job that is displayed when we make a query using the squeue command,
      • --partition = gridlong: the partition within which we want to run our job (there is only one partition on the NSC cluster, so we can also omit this setting),
      • --mem-per-cpu = 100MB: amount of system memory required by our job for each task (looking at the processor core),
      • --output = my_job.out: the name of the file in which the content that our job would print to standard output (screen) is written,
      • --time = 00: 01: 00: time limit of our job in hour: minute: second format.
      This is followed by the launch of our job, which is the same as in the previous cases (hostname).

      Save the content in the box above to a file, such as job.sh, and run the job:
      We can see that the command printed out our job identifier and immediately gave us back control of the command line. When the job is completed (we can check when with the help of the squeue command), the file my_job.out will be created in the current folder, in which the result of the execution will be displayed.

      Scancel command

      Jobs started with the sbatch command can be terminated with the scancel command during execution. We only need to specify the appropriate job identifier (JOBID).

      Salloc command

      The third way to start a job is with the salloc command. It is used to predict how many computing capacities we will need for our tasks, and then we run jobs directly from the command line with the srun command. The advantage of using the salloc command is that when starting business with srun, we do not have to wait for free capacities every time. The salloc command also uses the same configuration switches as the srun and sbatch commands. If we reserve resources with the salloc command, then we do not need to constantly specify all the requirements for the srun command. Example of running two instances of hostname on one node: When using the salloc command, srun works similarly to using sbatch. With its help, we run our tasks on the already acquired computing capacities. The acquired calculation capacities are released by running exit at the command line after the end of execution. The salloc command offers us another interesting option for exploiting computing nodes. With it, we can obtain capacities on a computing node, connect to it via the SSH protocol and then execute commands directly on the node. On the nsc-fp005 node, we run the hostname program, which displays the node name. After completing the work on the computing node, we return to the login node using the exit command. Here we perform the exit again to release the capacities we have occupied with the salloc command.
    • Modules and containers 

      Ordinary users (ie non-administrators) cannot install programs on the system. Installation must be arranged with the cluster administrator. You can always compile all the necessary software yourself and install it in your home directory, but this is a rather time-consuming and annoying task. In this section, we look at two approaches to make it easier to load a variety of software packages that we often use in supercomputing.

      Environment modules

      The first approach is environment modules, which include selected user software. Modules are usually prepared and installed by an administrator, who also includes them in the module catalog. The user can then turn the modules on or off with the module load or module unload commands. Different modules can also contain versions of the same program, for example with and without support for graphics accelerators. A list of all modules is obtained with the commands module avail and module spider.

      We will need the FFmpeg module in the workshop. This is already installed on the NSC cluster, we just need to load it:
      Use the module list command to see which modules have been loaded.

      Containers

      The disadvantage of modules is that they must be prepared and installed by an administrator. If this is not possible, we can choose another approach and package the program we need in a Singularity container. Such a container contains our program and all other programs and program libraries that it needs to function. You can create it on any computer and then copy it to a cluster.

      When we have the appropriate container ready, we use it by always writing singularity exec <container> before the desired command. The FFmpeg container (ffmpeg_apline.sif file) is available here. Transfer it to a cluster and run:
      A printout with information about the ffmpeg software version is displayed. Use the singularity program to run the ffmpeg_alpine.sif container. Then run the ffmpeg program in the container.

      You can also build the ffmpeg_alpine.sif container yourself. Searching the web with the keywords ffmpeg, container and docker, probably brings us to the website https://hub.docker.com/r/jrottenberg/ffmpeg/ with a multitude of different containers for ffmpeg. We choose the current version of the smallest, ready for Alpine Linux, which we build right on the login node.

      More detailed instructions for preparing containers can be found at https://sylabs.io/guides/3.0/user-guide/.

      Quite a few frequently used containers are available in clusters to all users:

      • on the NSC cluster they are found in the / ceph / grid / singularity-images folder,
      • on the Maister and Trdina clusters in the / ceph / sys / singularity folder.