Section: Introduction | HPC Hackathon materials

The term supercomputer refers to the most powerful computer systems available at a given time. The most powerful systems in the world are published on top500.org. When working with supercomputers, we often talk about high-performance computing (HPC), as these are computer systems that are much more powerful than personal computers. Today's supercomputers consist of a multitude of computers or nodes interconnected by specially adapted networks for fast data transfer. Many nodes are additionally equipped with accelerators for certain types of calculations. Computer systems composed in this way are also referred to as a computer cluster.

Supercomputers are designed for engineers, scientists, data analysts, and others who struggle with vast amounts of data or complex calculations that require billions of computational operations. They are used for computationally intensive calculations in various scientific fields such as biosciences (genomics, bioinformatics), chemistry (development of new compounds, molecular simulations), physics (particle physics, astrophysics), mechanical engineering (construction, fluid dynamics), meteorology, weather forecasting, climate studies), mathematics (cryptology) and computer science (law detection in data, artificial intelligence, image recognition, natural language processing).

Gaining access

Information about the procedure for obtaining access is usually published on the website of the supercomputer center. Computer systems under the auspices of SLING use the single sign-on system SLING SSO (Single Sign-On). More details can be found on the pages of SLING and the HPC-RIVR project.

Login nodes

During our course we will start operations on the supercomputer via Slurm middleware. The computer clusters within the open access that allow working with the Slurm system are NSC, Maister and Trdina. To work with the Slurm system, we connect to the computer cluster via the login node. Names of login nodes on the mentioned clusters:

nsc-login.ijs.si (NSC),
rmaister.hpc-rivr.um.si (Maister) and
tvrdina-login.fis.unm.si (Trdina).

A computer cluster consists of a multitude of nodes fairly closely connected to a network. Nodes work in the same way and consist of the same elements as found in personal computers: processors, memory, and input / output units. Clusters, of course, predominate in the quantity, performance, and quality of the built-in elements, but they usually do not have input-output devices such as a keyboard, mouse, and screen.

Node types

We distinguish several types of nodes in clusters, their structure depends on their role in the system. The following are important for the user:

head nodes,
login nodes,
compute nodes and
data nodes.

The main node ensures the coordinated operation of the entire cluster. It runs programs that monitor the status of other nodes, classify transactions into computational nodes, control the execution of transactions and more.

Users log in to the login node using software tools via the SSH protocol. Through the login node, we transfer data and programs to and from the cluster, prepare, monitor and manage transactions for computing nodes, reserve computing time at computing nodes, log in to computing nodes etc.

The tasks that we prepare at the login node are performed at the computational nodes. We distinguish several types of computational nodes. In addition to the usual processor nodes, we have more powerful nodes with more memory and nodes with accelerators, such as general-purpose graphics processing units. Computational nodes can be grouped into groups or partitions.

We store data and programs on data nodes. Data nodes are connected to a distributed file system (for example, ceph). The distributed file system is seen by all login and array nodes. Files that are transferred via the login node to the cluster are stored in a distributed file system.

All nodes are interconnected by high-speed network connections, usually including an Ethernet network and sometimes an Infiniband (IB) network for efficient communication between computing nodes. For network connections, it is desirable that they have high bandwidth (the ability to transfer large amounts of data) and low latency (they need a short time to establish a connection).

Node structure

The vast majority of computers today follow Von Neumann’s architecture. In the processor, the control unit takes care of the coordinated operation of the system, reads commands and operands from memory and writes the results to memory. The processor executes commands in the arithmetic logic unit (ALE), using registers (for example, tracking program flow, storing intermediate results). Memory stores data - commands and operands, and input-output units transfer data between the processor and memory and the outside world.

Graphic processing units

Some nodes are additionally equipped with calculation accelerators. Today, graphics processing units are mostly used as accelerators. The basic task of graphics cards is to relieve the processor when drawing graphics on the screen. When plotting on a screen, they have to perform a multitude of independent calculations for millions of screen points. When they are not plotting on the screen, however, they can only count.

We are talking about General Purpose Graphics Porcessing Units (GPGPU). They cut perfectly whenever we have to do a multitude of similar calculations on a large amount of data with few dependencies. For example, in deep learning in artificial intelligence and in video processing. The architecture of graphics processing units is quite different from the architecture of conventional processors. Therefore, in order to effectively implement programs on graphics processing units, we need to heavily rework existing programs.

Computer cluster software consists of:

operating system,
middleware and
user software (applications).

Operating system

An operating system is a bridge between user software and computer hardware. It is software that performs basic tasks such as: memory management, processor management, device control, file system management, implementation of security functions, system operation control, resource consumption monitoring, error detection. Popular operating systems are free Linux, paid macOS and Windows. The CentOS Linux operating system is installed on the nodes of the NSC, Maister, and Trdina clusters.

Intermediate software

Middleware is software that clusters the operating system and user applications. In the computer cluster, it takes care of the coordinated operation of a multitude of nodes, enables centralized management of nodes, takes care of user authentication, controls the execution of transactions (user applications) on nodes and the like. Users of computer clusters mostly work with middleware for business monitoring, and SLURM (Simple Linux Utility for Resource Management) is very widespread. The Slurm system manages the queue, allocates the required resources to the business and monitors the execution of business. With the Slurm system, users provide access to resources (computing nodes) for a certain period of time, start transactions and monitor their implementation.

User software

User software is the key software that makes us use computers, both regular and computer clusters. With the user software, users perform the desired functions. Only user software adapted for the Linux operating system can be used on clusters. Some examples of user software used on clusters: Gromacs for molecular dynamics simulations, OpenFOAM for fluid flow simulations, Athena collision analysis software on the LHC collider at CERN (ATLAS), TensorFlow for learning deep models in artificial intelligence. In the workshop, we will use the FFmpeg video processing tool.

User software can be clustered in a variety of ways:

the administrator installs it directly on the nodes,
the administrator prepares environmental modules,
the administrator prepares containers for general use,
the user places it in his folder, the user prepares the container in his folder.

To make system maintenance easier, administrators install the user software in the form of environmental modules, preferably in the form of containers.

Environmental modules

When logging in to the cluster, we find ourselves in the command line with the standard environment settings. This environment can be supplemented for easier work, most simply with environmental modules. Environmental modules are a tool for changing command line settings and allow users to easily change the environment while working.

Each environment module file contains the information needed to set the command line for the selected software. When we load the environment module, we adjust the environment variables to run the selected user software. One such variable is PATH, which lists the folders where the operating system searches for programs.

The environmental modules are installed and updated by the cluster administrator. By preparing environmental modules for the software, it facilitates maintenance, avoids installation problems due to the use of different versions of libraries, and the like. The prepared modules can be loaded and removed by the user during work.

Virtualization and containers

As we have seen, we have a multitude of processor cores on nodes that can run a multitude of user applications simultaneously. When installing user applications directly on the operating system, it can get stuck, mostly due to improper versions of libraries.

Node virtualization is an elegant solution that ensures the coexistence of a wide variety of user applications and therefore easier system management. The capacity of the system is slightly lower due to virtualization, of course at the expense of greater robustness and ease of system maintenance. We distinguish between hardware virtualization and operating system virtualization. In the first case we are talking about virtual machines (virtual machines), in the second about containers (containers).

Container virtualization is more suitable for supercomputer clusters. The containers do not include the operating system, so they are smaller and it is easier for the controller to switch between them. The container supervisor keeps the containers isolated from each other and gives each container access to a common operating system and core libraries. Only the necessary user software and additional libraries are then installed separately in each container.

Computer cluster administrators want users to use containers as much as possible, because:

we can prepare the containers ourselves (we have the right to install the software in the containers),
containers offer us many options for customizing the operating system, tools and user software,
containers ensure repeatability of business execution (same results after upgrading the operating system),
through containers, administrators can more easily control resource consumption.

Docker containers are the most common. On the mentioned clusters, a Singularity controller is installed for working with containers, which is more adapted for work in a supercomputer environment. In the Singularity container environment, we can work with the same user account as on the operating system, we have organized access to the network and data. The Singularity monitor can run Docker and other containers.

Course HPC Hackathon materials: Introduction

Section outline

Gaining access

Login nodes

Node types

Node structure

Graphic processing units

Operating system

Intermediate software

User software

Environmental modules

Virtualization and containers

Course HPC Hackathon materials: Introduction

Section outline

Supercomputer

Gaining access

Login nodes

Cluster hardware

Node types

Node structure

Graphic processing units

Cluster software

Operating system

Intermediate software

User software

Environmental modules

Virtualization and containers