The Power of CPU Cache: Small in Size, Big on Speed

This article is about the CPU cache – what is it? what types of cache there are? and how it maps data from the main memory.

Imagine you’re playing a video game where the screen is filled with hundreds of enemies, items, and other game objects, each moving, animating, and reacting to the environment in real-time. Behind the scenes, the game needs to update the state of each of those entities every frame—usually 60 times per second.

Before the mid-2000s, most games used Object-Oriented Design to define the architecture of the game. In this design, every character and item encapsulates its internal properties and keeps them close in memory. However, these game entities are not stored next to each other; they are scattered throughout memory.

Before the mid-2000s, most games used Object-Oriented Design to define the architecture of the game. In this design, every character and item encapsulates its internal properties and keeps them close in memory. However, these game entities are not stored next to each other; they are scattered throughout memory.

In most cases, to loop through the list of game entities and update them, the memory has to retrieve data from many different locations. This repeated access to main memory and jumping around dozens of times per second is very inefficient and causes performance degradation.

To overcome this inefficiency in modern games, developers often group similar data components of these entities together in memory, rather than following a traditional object-oriented approach. For example, instead of having each character store all its data in a single object, a game might store all positions in one continuous block of memory, all velocities in another, and so on.

This layout, known as Data-Oriented Design, allows the game to iterate over and update all similar properties at once. Modern games use an Entity-Component System to manage individual components of entities and group them together in memory.

Now, why am I telling you all of this? Because using the Data-Oriented approach leverages the CPU cache to its fullest potential.

CPU Cache Basics

The CPU cache is a small, extremely fast memory component that stores copies of the data the CPU is likely to use next. Instead of accessing the main memory, which is very slow, the CPU can copy chunks of data from the main memory to its cache, read and write the data in the cache, and update the main memory later if the data in the cache was modified.

Modern CPUs usually have multiple levels of cache: L1, L2, and L3, with varying sizes and speeds, where L1 is the smallest and fastest, and L3 is larger but slower. These caches are found either inside the CPU package or very close to it. For simplicity, I will talk about the concept of cache in general, rather than getting into the specifics of cache architecture and hierarchies. Let’s see how the cache works in action.

Back to our game example, every time the game needs to update the position of all characters, it has to loop through each of the position values and update them based on some physics formula. The first thing the CPU does when asked to read from memory is to check whether this memory address is already in the cache. In our example, the first position value is not in the cache because the CPU hasn’t fetched it from the main memory yet. This is known as a Cache Miss.

In this case, the data from the requested address will be copied from the main memory to the cache and updated from there. In addition to the requested data, the addresses following the requested one will also be copied from the main memory to the CPU cache because programs usually access data in memory sequentially. When the game tries to update the next position value, it will already be in the cache, so the CPU can directly update the value in the cache and avoid accessing the slow main memory. This is known as a Cache Hit.

When data is stored contiguously in memory, the likelihood of “cache hits” increases, the CPU spends less time waiting for data from the slower main memory, which increases performance and makes the program run faster.



Space and Time Locality

The efficiency of cache usage is heavily influenced by the ‘Locality of Reference’ principle, which has two key components: Temporal Locality and Spatial Locality.

  1. Temporal Locality suggests that recently accessed data is likely to be accessed again in the near future. This principle refers to the likelihood that if a program accesses a particular memory location, it will access the same memory location again soon. In our video game example, if the position of a character is updated every frame, the position data will be accessed repeatedly in a short period of time. By keeping frequently accessed data in the cache, the CPU can quickly access this data without needing to reload it from slower main memory, thus improving performance.
  2. Spatial Locality suggests that data physically close in memory tends to be accessed around the same time. When a program accesses a specific memory location, it is likely to access nearby memory locations shortly after. In our example, when the game updates the position of one character, it will usually update the positions of all adjacent characters stored contiguously in memory. By organizing data contiguously, the game ensures that when one piece of data is loaded into the cache, the adjacent data is also loaded, leading to fewer cache misses and more efficient memory access.

Cache Memory Structure

Let’s talk about the cache memory structure. There are several types of cache components, the most notable of which are the instruction cache and the data cache. The instruction cache is a small hardware component that accelerates the fetching and execution of CPU commands, while the data cache is used to speed up data access. For this explanation, I will focus only on the data cache.

A typical cache organizes its data into lines, which are the smallest chunks of memory the cache can transfer. A cache line is usually 32, 64, or 128 bytes, depending on the architecture, with 64-byte lines being the most common choice. The data is also tagged with a number representing the general location in main memory from which the data came.

To accelerate the process of finding data in the cache, it is divided into sets of cache lines, where each set contains multiple cache lines. Each memory chunk across all sets is known as a cache bank or a ‘way’. This type of cache is called an N-way set associative cache. For example, a cache with 8 ways is described as an “8-way set associative” cache.

Let’s consider a computer that can handle up to 64 GB of RAM and see how memory addresses are mapped into a 32 KB cache. Let’s assume it has a cache with 64 sets and 8 ways, where each cache line contains 64 bytes of data.

To copy data from the main memory to the cache, the CPU sends the cache 64 bytes of data read from the main memory, along with the address from which this data came. Since the address space is 64 GB, it requires 36 bits to represent each address in the main memory. These 36 bits are divided into 3 sections:

  • 24 bits that represent the tag of the data
  • 6 bits for the set index, which tells the cache in which set this data is found
  • 6 bits for the offset within a single cache line to find the requested byte

When the cache receives the address, it extracts the 6 bits of the ‘set index’ and places the data in the next available cache line in that set. In addition, it copies the 24-bit tag associated with that data.

When the CPU wants to read the data from the cache, it sends the requested address to the cache, which performs the same process in reverse: it extracts the ‘set index’ from the given address, finds the 24-bit tag, and sends back the data associated with that tag. The CPU will then take the first 6 bits of the address, which represent the offset in the cache line, and extract the exact required data.

Cache Associativity

In the example above, every memory address is mapped to its specific set in the cache but still has flexibility regarding which line to go to within that set. However, in an extreme case, a cache can have only one set, which means data from main memory can go anywhere in the cache. This type of cache is called Fully Associative It is very flexible, but it requires more power and chip area to retrieve data.

On the other extreme, a cache set can have a single ‘way’, which means every main memory address can only be mapped to one specific location in the cache. This implementation is known as Direct-mapped cache. In this case, copying data into the cache or retrieving it is very fast, but it suffers from many conflict misses, which occur when the designated cache line is already occupied and must be evicted to make room for the new data.

In reality, most cache implementations are a compromise between those two extremes.

Summary

Want to learn more about computer science and game developement? Check out the Night Quest Games Blog for more great articles like this one!

Leave a Comment

Your email address will not be published. Required fields are marked *