Unlocking the Secrets of GPU Architecture: Understanding How a GPU Works and Why it’s Superior to a CPU

TJ. Podobnik, @dorkamotorka
7 min readApr 12, 2022

In this post, we’ll discuss the architecture of a GPU and how it differs from a traditional CPU. I’ll explain the basic components of a CPU, such as fetch/decode logic, execution context, ALU, and out-of-order logic, and how they are simplified and scaled up in a GPU. I will also describe how the workload is distributed among the compute units in a GPU and provide a real-world example of the NVIDIA GeForce GTX780 GPU. In the end, we will compare different generations of NVIDIA GPUs and provide an overview of the memory hierarchy in GPU systems.

The post is written in a technical and theoretical style and is aimed at readers who want to understand how GPUs work and how to effectively run programs on them.

Evolution of CPU to GPU

I think it’s good to understand the general CPU architecture first since the GPU version is basically just an optimized/simpler and scaled-up version of it. The simplified structure of a single-core CPU is composed of the following components:

  • Fetch/Decode logic is responsible for fetching the instructions from memory, decoding them to prepare operands, and selecting the required operation in ALU
  • Execution context comprises the state of the CPU such as a program counter, a stack pointer, a program-status register, and general-purpose registers
  • ALU (Arithmetic logic unit) is the processing element of the CPU that executes the instruction
  • Out-of-order logic assures the CPU avoids being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that can run immediately and independently
  • Branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if-then-else structure) will go before this is known definitively
  • Pre-fetcher logic fetches instructions from slower memory to a faster cache memory before it is actually needed
  • Cache memory temporary stores operands and instructions from memory to significantly improve the processing, since the transfer to and from main memory requires a considerable amount of power and time
The basic structure of a general-purpose
single-core CPU

To build a GPU that comprises tens of thousands of CPUs (for parallel processing), we need a slimmer design than the CPU. For this reason, all complex and large units should be removed from the general-purpose CPU: a branch predictor, out-of-order logic, caches, and a cache pre-fetcher.

The basic structure of a general-purpose single-core GPU

But as we know, a single-core(with one ALU) GPU that can only execute one thread at a time doesn’t make sense, nowadays. One might think: “OK, let’s just replicate this simple structure to as many threads as we wish to have”. In that case, we would have the following structure:

The basic structure of a general-purpose double-core GPU

But if you think about it a bit more wisely, both ALUs will very often execute the same instructions stream but on different data (e.g. different pixels in case of image processing). Therefore we replicate ALUs and execution contexts while sharing the fetch/decode logic and achieve higher performance (and slimmer design). Obviously then there is some additional work in Fetch/Decode instructions to dispatch the correct data (and correct instructions) to each ALU. Let’s look at architecture with 8 ALUs and executions contexts:

8 ALUs + 8 execution context architecture

We can also see an additional element called Shared Data (later referenced as Local memory), to efficiently share data among the threads. The image above represents a single-core GPU that can run 8 threads concurrently (one ALU and one execution context per thread). Further on, I will refer to a core as a Compute Unit (CU) and ALU as Processing Element (PE).

To give you a real-world example, the NVIDIA GeForce GTX780 GPU contains 2304 processing elements. These processing elements are organized into 12 CU cores, which then if you divide you find out there are 192 PEs per CU. Each CU contains 65,536 (64K) 32-bit registers, to achieve large-scale multithreading.

Comparison of NVIDIA GPU generations

We are coming a lot closer now to understanding all these different terms and why would you even bother with that. Let’s have a look at how threads are scheduled to execute on CUs.

Scheduling threads on Compute units

In the following section, I will also use the term work-item for a thread and a work-group for a group of threads, since that is what I find is mostly used nowadays.

I think it’s better to think in terms of threads as a work-group since a work-group is a group of threads that are executed on the SAME compute unit (work-group cannot be divided between CU). This implies not only threads in the same work-group can share a common local memory as already mentioned before, but also can be synchronized by utilizing barriers (maybe later on that).

In modern GPUs, the compute unit schedules and executes work-items from the same work-group in groups of 32 parallel work-items called warps. I’m not at the right address to answer why exactly 32 threads are grouped, but what this implies is that each core (in NVIDIA GPU) has a multiple of 32 ALUs, which is also the reason why we allocate a multiple of 32 threads per work-group to optimally load the GPU.

When a compute unit is given one or more work-groups to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution. Important to realize here is that warps execute one instruction at a time (occupying 32 threads), consequently the best performance is achieved when all work-items from the same warp execute the same instruction, on the other hand, if there are different instructions to be executed, some threads stay idle since each instruction is executed on 32 threads at a time. For example, if have an if-else statement in a function that each of 32 threads executes and 22 threads evaluate the if statement to True and the rest to False. What will happen is that first all the threads(22) that are evaluated to True, will execute while 10 threads will stay idle, and only after that 10 threads that are evaluated to False will execute while leaving 22 threads idle. So the if-else statement is executed in 2 steps.

The following picture summarizes all the terms and hopefully gives you a better picture of what’s going on:

If you haven’t been able to figure it out yourself, GPUs are extremely good when you have to apply the same operation on a large number of data, since it allows you to do that in parallel. While the CPU might still perform better if your code consists of lots of (if-else) branching where the GPU would suffer from idle threads etc.

I have talked a lot about the memory, registers, local memory, and cache all over the place, but I haven’t really talked about the impact they have on the program execution efficiency.

Memory hierarchy on GPU

In general, the memory of the GPU is composed of five regions accessible from each thread:

  • Registers — Evenly distributed between threads in all the work-groups on a single CU (e.g. if you have 256 work-items per work-group and there are 4 work-groups, then there is 65536 /(256*4) = 64 registers per work-item on a CU).
  • Local Memory — This memory is mainly used for data interchange within a work-group
  • Global Memory or better known as graphics card RAM. It also includes two additional read-only memory spaces: Texture Memory and Constant Memory. Constant memory stores constants and program arguments, while Texture memory is optimized to store 2D objects like an image.

Their importance lay in the access time or a delay to read or write to it. As we will see later, being able to distinguish between memory levels can significantly improve the speed of program execution.

Access time by memory level

In this post, the reader has learned about the inner workings of a GPU and how it differs from a traditional CPU. They have gained an understanding of the structure and components of a GPU, including compute units and processing elements. The author also delves into the scheduling of work-groups and work-items on CUs and the different types of memory present on a modern GPU. Additionally, the reader now understands the memory hierarchy of a modern GPU and the role of the programmer in partitioning programs into work-groups and work-items. This post provides a comprehensive overview of GPU architecture and is essential reading for those interested in understanding how to effectively program and run programs on a GPU.

Thanks for reading! :) If you enjoyed this article, hit that clap button below 👏

Would mean a lot to me and it helps other people see the story. Say Hello on Instagram | Linkedin | Twitter

Do you want to start using Medium? Use this referral link 🔗

If you liked my post you can buy me a Hot dog 🌭

Follow me for more related content, cheers!

--

--