Section outline

    • OpenCL


      So far, we have learned how the GPU is built, what the compute units and process elements are, how the programs run on it, how the workgroups and threads are arranged, and what the GPU memory hierarchy is. Now is the time to look at what the software model is.

      OpenCL framework


      OpenCL (Open Computing Language) is an open source framework designed for parallel programming of a wide range of heterogeneous systems. OpenCL contains the OpenCL C programming language, designed to write program code that will run on devices (such as GPU or FPGA). We have learned before that we call such programs kernels. The same kernel executes all threads on the GPU. Unfortunately, OpenCL has one major drawback: it is not easy to learn. Therefore, before we can write our first program in OpenCL, we need to learn some important concepts.

      Platform and devices


      OpenCL sees a general heterogeneous system as a platform consisting of a host and one or more different devices. Devices are usually GPU or FPGA accelerators. The platform is shown in the image below.

      Each program in OpenCL consists of two main parts:

      • code running on the host program, and
      • one or more kernels running on the device.

      OpenCL implementation model


      We have already met the GPU implementation model before. We learned about compute units (CU) and process elements (PE) and threads and groups of threads. OpenCL uses a similar abstraction - a kernel executes a large number of threads simultaneously. These are grouped into working groups. Threads in workgroups can share data over local memory and can be synchronized with each other.

      Each thread has its own unique identifier (ID). In principle, IDs are neither one-, two-, or three-dimensional, and are given with integers. For the purposes of the most comprehensible explanation, we will limit ourselves to two dimensions below. Thread indexing is always adapted to the spatial organization of the data. For example, if our data is vectors, we will neither index in one dimension, but if our data is a matrix or image, we will not index in two dimensions. Thus, OpenCL allows us to spatially organize threads in the same way as data. For example, in the case of images, the indexes of the threads will coincide with the indexes (coordinates) of the image points. In OpenCL terminology, we call thread space NDRange. We also index working groups in one-, two- or three-dimensional space. An example of two-dimensional indexing is shown in the figure below.

      Each thread in the 2D space of the NDRange has two pairs of indexes:
      • global index (global ID) and
      • local index (local ID).
      The global index indicates the index neither in the whole space nor the NDRange, while the local index neither indicates its position within the working group. In the figure above, the NDRange thread space contains 16 x 16 = 256 threads in 4 work groups. Each group, however, has 8x8 = 64 threads. The global index of each thread is determined by a pair (gy, gx), where gy denotes the index in the vertical direction (i.e., row) and gx denotes the index in the horizontal direction (i.e., column). The coordinate (0,0) is always in the upper left corner. Also, the local index of all threads is determined by a pair (ly, lx). For an example, let’s look at a thread marked in green. Its global index is (12.10) while its local index is (4.2). The index of each working group of threads is determined by a pair (wy, wx). Each NDRange thread space has its own dimensions that tell how many threads or groups of threads make it up. The global dimension of the NDRange space in the figure above is determined by a pair (GY, GX) and is (16.16), while the local dimension of an individual group is determined by a pair (LY, LX) and is (8.8). The global dimension of the NDRange thread space can also be measured in thread working groups and is determined by the pair (WY, WX) in the figure above and is (2.2).

      OpenCL memory model

      OpenCL contains an abstraction of the memory hierarchy on the GPU shown in the figure below.


      The memories provided by OpenCL are:


      • We store variables in private memory that will be private for each thread. In GPU, private memory is represented by registers of arithmetic elements. If we want a variable to be private to a thread, then we need to declare it with the __private complement. In OpenCL, all variables are private by default and no add-on needs to be specified.
      • We store variables in local memory that we want to be visible to all threads in the same workgroup. If we want a variable to be stored in local memory, we add the __local extension to the declaration.
      • In global memory, we store variables that we want to be visible to all threads from all workgroups in the NDRange thread space. If we want a variable to be stored in global memory, we must declare it with the __global extension.
      • Constant memory is the part of global memory that does not change during execution. If we want a constant to be stored in global memory, we use the __constant suffix next to the declaration.

    • Creating an OpenCL environment on the host


      We have already learned that every program we want to run on a heterogeneous system with GPU consists of two parts:

      • program code running on the host and
      • pliers running in parallel on the GPU device.
      We will now first learn what the task of a program running on a host is. The program on the host is called from the main() function and is responsible for the following steps:
      1. creating an OpenCL environment,
      2. translation of OpenCL pliers,
      3. declaration and initialization of all variables,
      4. data transfer to the device,
      5. starting kernels and
      6. data transfer from the device.
      The program on the host therefore provides everything needed to run the kernel on the device. The program on the host will be written in C and using certain functions of the OpenCL API (Application Programming Interface). This program runs sequentially on the host's central processing unit and does not contain any parallelism. A detailed description for the OpenCL API can be found on The OpenCL Specification website.

      OpenCL environment


      The first task that a program that will run on a host must perform is to create an OpenCL environment. The OpenCL environment is an abstraction of platform, devices, context, and command types. All these concepts will be described and defined below. Here we introduce them only briefly:
      • The platform is basically a software driver for the specific devices we will be using (for example, Nvidia).
      • The device is a concrete device inside the platform (for example, Nvidia Tesla K40).
      • The context will contain the program for each device and its memory objects (space in the global memory in which we will store data) and command types.
      • A command line is a software interface for transferring commands between a host and a device. Examples of commands are requests to transfer data to / from the device or requests to start kernels on the device.
      An OpenCL environment needs to be created for each heterogeneous platform on which we will later run Pliers. Creating an OpenCL environment is done in the following steps:
      • selecting a platform that contains one or more devices,
      • selecting devices within the platform on which we will run pliers,
      • creating a context that includes devices and command and memory types,
      • creating command lines to transfer commands from the host to the device.
      Let’s take a closer look at each of these steps below.

      List of platforms


      Each kernel is performed on some sort of platform (heterogeneous computer system) containing one or more devices. The platform can be thought of as an OpenCL driver for a specific type of device. In order to be able to adapt the pliers to the devices at all, we must first determine on which platform our program will run and whether the platform contains any device at all. To do this, call the function clGetPlatformIDs():

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
          //***************************************************
          // STEP 1: Discover and initialize the platforms
          //***************************************************
          cl_uint numPlatforms;
          cl_platform_id *platforms = NULL;
      
          // Use clGetPlatformIDs() to retrieve the number of platforms present
          status = clGetPlatformIDs(
              0, 
              NULL, 
              &numPlatforms);
          clerr_chk(status);
      
          // Allocate enough space for each platform
          platforms = (cl_platform_id *)malloc(sizeof(cl_platform_id)*numPlatforms);
      
          // // Fill in available platforms with clGetPlatformIDs()
          status = clGetPlatformIDs (
              numPlatforms, 
              platforms, 
              NULL);
          clerr_chk(status);
      

      The clGetPlatformIDs() function must be called twice. First, to find out how many platforms we have in the system, and second, to initialize a list of platforms for all platforms.

      List of devices


      Now we need to figure out how many and which devices we have in each platform. To do this, call the function clGetDeviceIDs():

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
          //***************************************************
          // STEP 2: Discover and initialize the devices
          //***************************************************
      
          cl_uint numDevices;
          cl_device_id *devices = NULL;
      
          // Use clGetDeviceIDs() to retrieve the number of devices present
          status = clGetDeviceIDs(
                                  platforms[0],
                                  CL_DEVICE_TYPE_GPU,
                                  0,
                                  NULL,
                                  &numDevices);
          clerr_chk(status);
      
      
          // Allocate enough space for each device
          devices = (cl_device_id*) malloc(numDevices*sizeof(cl_device_id));
      
          // Fill in devices with clGetDeviceIDs()
          status = clGetDeviceIDs(
                                  platforms[0],
                                  CL_DEVICE_TYPE_GPU,
                                  numDevices,
                                  devices,
                                  NULL);
          clerr_chk(status);
      

      The clGetDeviceIDs() function must be called twice again. First, to find out how many devices we have in the platform, and second, to transfer all the devices to a previously prepared list of devices.

      As a rule, all OpenCL functions return an integer value (in our case called status), which contains the error code or successful execution. We must always check the status value and stop the program in case of an error.

      Properties of platforms and devices


      Once we have a list of platforms and devices in them, we can determine their properties for each platform or device by calling the clGetPlatformInfo() and clGetDeviceInfo() functions:

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
          printf("=== OpenCL platforms: ===\n");    
          for (int i=0; i<numPlatforms; i++)
          {
              printf("  -- The platform with the index %d --\n", i);
              clGetPlatformInfo(platforms[i],
                              CL_PLATFORM_NAME,
                              sizeof(buffer),
                              buffer,
                              NULL);
              printf("  PLATFORM_NAME = %s\n", buffer);
      
              clGetPlatformInfo(platforms[i],
                              CL_PLATFORM_VENDOR,
                              sizeof(buffer),
                              buffer,
                              NULL);
              printf("  PLATFORM_VENDOR = %s\n", buffer);
      
          }
      

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
          printf("=== OpenCL devices: ===\n");
          for (int i=0; i<numDevices; i++)
          {
              printf("  -- The device with the index %d --\n", i);
              clGetDeviceInfo(devices[i],
                              CL_DEVICE_NAME,
                              sizeof(buffer),
                              buffer,
                              NULL);
              printf("  CL_DEVICE_NAME = %s\n", buffer);
      
              clGetDeviceInfo(devices[i],
                              CL_DEVICE_VENDOR,
                              sizeof(buffer),
                              buffer,
                              NULL);
              printf("  CL_DEVICE_VENDOR = %s\n", buffer);
      
              clGetDeviceInfo(devices[i],
                              CL_DEVICE_MAX_CLOCK_FREQUENCY,
                              sizeof(buf_uint),
                              &buf_uint,
                              NULL);
              printf("  CL_DEVICE_MAX_CLOCK_FREQUENCY = %u\n",
                     (unsigned int)buf_uint);
      
              clGetDeviceInfo(devices[i],
                              CL_DEVICE_MAX_COMPUTE_UNITS,
                              sizeof(buf_uint),
                              &buf_uint,
                              NULL);
              printf("  CL_DEVICE_MAX_COMPUTE_UNITS = %u\n",
                     (unsigned int)buf_uint);
      
              clGetDeviceInfo(devices[i],
                              CL_DEVICE_MAX_WORK_GROUP_SIZE,
                              sizeof(buf_sizet),
                              &buf_sizet,
                              NULL);
              printf("  CL_DEVICE_MAX_WORK_GROUP_SIZE = %u\n",
                     (unsigned int)buf_sizet);
      
              clGetDeviceInfo(devices[i],
                              CL_DEVICE_LOCAL_MEM_SIZE,
                              sizeof(buf_ulong),
                              &buf_ulong,
                              NULL);
              printf("  CL_DEVICE_LOCAL_MEM_SIZE = %u\n",
                     (unsigned int)buf_ulong);   
          }
      

      The result of the above code for the Nvidia Cuda platform and the Tesla K40 device is as follows:

      === OpenCL platforms: ===
        -- The platform with the index 0 --
        PLATFORM_NAME = NVIDIA CUDA
        PLATFORM_VENDOR = NVIDIA Corporation
      === OpenCL devices: ===
        -- The device with the index 0 --
        CL_DEVICE_NAME = Tesla K40m
        CL_DEVICE_VENDOR = NVIDIA Corporation
        CL_DEVICE_MAX_CLOCK_FREQUENCY = 745
        CL_DEVICE_MAX_COMPUTE_UNITS = 15
        CL_DEVICE_MAX_WORK_GROUP_SIZE = 1024
        CL_DEVICE_LOCAL_MEM_SIZE = 49152
      

      Creating an OpenCL context


      Once we have selected the platform and devices, we need to create a context for the devices. Only devices from the same platform can be included in the context. A context is a kind of abstract container that includes devices, their command types, memory objects, and programs designed for a particular device. The context is created with the OpenCL function clCreateContext():

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
          //***************************************************
          // STEP 3: Create a context
          //***************************************************
      
          cl_context context = NULL;
          // Create a context using clCreateContext() and
          // associate it with the devices
          context = clCreateContext(
                                    NULL,
                                    numDevices,
                                    devices,
                                    NULL,
                                    NULL,
                                    &status);
          clerr_chk(status);
      

      Creating command lines


      Now we need to create a command line for each device separately. The command type is used to transfer commands from the host to the device. Examples of commands are writing and reading from device memory or downloading and running a kernel. Create a command line with the clCreateCommandQueue() function:

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
          //***************************************************
          // STEP 4: Create a command queue
          //***************************************************
          cl_command_queue cmdQueue;
          // Create a command queue using clCreateCommandQueue(),
          // and associate it with the device you want to execute
          // on
          cmdQueue = clCreateCommandQueue(
                                          context,
                                          devices[0],
                                          CL_QUEUE_PROFILING_ENABLE,
                                          &status);
          clerr_chk(status);
      

      The full code from this chapter can be found in the 01-discover-devices folder here.