How 25 NOPs fixed a 10-year-old GTA V performance issue on AMD
A years-old GTA V performance issue, FPS tanking at night when looking at the city, traced to six prefetchnta hints into write-combine vertex buffers, stalling the AMD Zen pipeline. The final fix is 25 NOPs in two places.
If you've played GTA V on PC you've probably seen it: stand on Mount Chiliad at night, point the camera at Los Santos, and watch your frame rate fall off a cliff. People have been complaining since at least 2019 on /r/GrandTheftAutoV_PC.
The real answer is two patterns of six bytes each, in a single function, on AMD CPUs.
This is the story of finding it, understanding why it hits AMD so much harder, and the 25-byte fix, which I worked on together with @divocbn.
The known unknown
The bug is famous enough to have folklore. Static scene, no chases, no NPCs, no rain, turn the camera toward downtown Los Santos around 2 AM and your CPU frame time roughly doubles. AMD owners get hit harder than Intel.
It's been like this since the original PC port. Forum threads, Reddit posts, "fixes" involving regedits and driver tweaks, none of it actually addressing the cause, because the cause is not in any .ini file.
The thing to hold on to: both vendors slow down at night, but AMD users slow down much harder. That asymmetry is a clue.
What's actually rendering at night
Up close, GTA V renders street lights as proper deferred light sources, volumetric, shadowing, the usual modern lighting cost. That's untenable for a panoramic view of Los Santos at night, where you can see thousands of points of light.
Beyond a distance threshold, lights switch to a cheap representation: a billboarded sprite per light, batched into a vertex buffer and uploaded as a few draw calls. Coronas, in old engine terminology. These are the distant LOD lights, and they're rendered by a single function: RenderDistantLODLights.
During the day, or out in the countryside, that function does almost nothing. Looking at the city at night, it iterates over every light bucket within draw distance, culls each one, writes the survivor's position and color into a vertex buffer, and dispatches a draw. Standard CPU-side scene work. The cost should be low. It wasn't.
Drag the slider to see what the function actually contributes, same camera, same time of day, only RenderDistantLODLights toggled off and on:


And it's not just the main camera pass. RenderDistantLODLights takes an renderMode parameter, and the engine calls it once per render phase that needs distant lights:
void RenderDistantLODLights(uint32_t mode, float intensity)
{
// ...
if (mode == 0x2) // WATER_REFLECTION
{
AdjustDistantLights(intensity);
if (intensity <= 0.0f) return;
}
// ...
}The five modes are DEFAULT, MIRROR_REFLECTION, WATER_REFLECTION, CUBE_REFLECTION, and SEETHROUGH. So when you look at the LS coast at night, or any scene with reflective water, the function runs twice, once for what you see directly, once for what the water reflects. Mirrors and cube-mapped surfaces add their own passes on top.
The prefetch stalls happen in every pass. Doubling or tripling the cost in exactly the scenes that already have the most lights on screen.


Locating the cost
Profiling GTA5 at night, looking at the LS skyline, the sampler kept pointing at the same address:
The top entry, GTA5 : 0x1405909D8, which the disassembly identifies as CLODLights::RenderDistantLODLights, sits at 29.1% of all samples in the frame, more than the next four functions combined. In wall-clock terms, around 1.6 ms of CPU per frame. Enormous for a function whose job is just to build a vertex buffer.
We pulled it up in a disassembler, plain x86-64 in GTA5.exe, and walked through the body. The structure is simple: two inner loops, one for street lights and one for everything else, each iterating over light entries and writing a position+color pair into a vertex buffer.
Six prefetchnta instructions stood out. Three at the top of each inner loop, all back-to-back, all targeting addresses derived from the vertex buffer pointer and the per-light position/color arrays:
PrefetchDC(pRGBIPrefetch + j); // per-light color array (cacheable)
PrefetchDC(pPositionPrefetch + j); // per-light position array (cacheable)
PrefetchDC(pOutputPrefetch + j); // D3D vertex buffer slot (write-combine)Standard prefetching pattern, the kind of thing nobody flags in a hot inner loop. A lookahead of j + 4 entries: 16 bytes for color, 48 bytes for position, 64 bytes for the output. The two array prefetches look reasonable enough; the third targets an address that, if you know D3D11 memory semantics, you immediately recognize as suspicious.
We tested the simplest possible intervention: NOP the six prefetches in the stock binary, change nothing else. Night FPS in the city went from ~140 to ~240 on AMD. The 1.6 ms cost dropped to ~0.17 ms.
Six hint instructions. Ten times the budget. That's what we needed to explain.
Root cause: prefetching write-combine memory
Here's why those prefetches are catastrophic.
What prefetchnta is supposed to do
prefetchnta is a hint to the CPU: "I'm going to read this address soon, please pull it into a non-temporal cache structure so the eventual load doesn't have to wait on DRAM." Intel docs call it a hint into the "non-temporal" cache level (typically L1 with a marker not to pollute higher caches); AMD has a similar idea with different hierarchy details. The contract is that the CPU is free to ignore it. It's a hint, not a load.
In raw bytes:
41 0F 18 44 8B XX ; prefetchnta byte ptr [r11 + rcx*4 + disp8]0F 18 /0 is the opcode family for the prefetch instructions; the /0 part means the reg field of ModR/M is zero, which selects the nta variant specifically. 41 is the REX.B prefix that extends the base register into the r8-r15 range.
What write-combine memory is
Vertex buffers in Direct3D 11, including the one this function writes into, are typically mapped as write-combine (WC) memory. WC is a memory type the CPU sets up so that writes to that region don't go through the normal cache hierarchy. Instead they collect inside a small set of dedicated buffers (the WC buffers) on the core, and when a buffer fills or is forced to drain, the whole thing flushes to the memory bus as a single wide burst write.
This is exactly what you want for vertex buffer uploads: you're writing tens of kilobytes the GPU will read once, never to be touched again. Caching it would pollute L1/L2/L3 with one-shot data. Write-combining bypasses the cache entirely and turns many small CPU writes into a few wide bus transactions.
The key property: WC memory is not cached, by design.
The conflict
In the inner loop of RenderDistantLODLights, the code is writing light entries into the vertex buffer (WC memory) and, between iterations, prefetching ahead in that same buffer:
_mm_prefetch((char*)(pOutputPrefetch + j), _MM_HINT_NTA);
// pOutputPrefetch -> Map()'d D3D vertex buffer (WC)The instruction is asking the CPU to cache something the system has explicitly marked uncacheable. Both architectures get the exact same instruction; they don't get the same behavior.
Intel documents this directly. The Intel SDM (Vol. 2A, PREFETCHh) states that prefetch hints to UC and WC memory regions are dropped:
"Prefetches from uncacheable or WC memory are ignored."
, Intel SDM,
PREFETCHh
The hint gets recognized early in the pipeline, never issues against the memory subsystem, and has no observable cost. The developer who wrote this code almost certainly tested on Intel and saw a clean profile.
AMD Zen takes the hint at face value. The core issues the fetch into the L1D pipeline, and because the address is classified as write-combining by the page attributes, the load can't be satisfied from cache, there is no cached copy, by definition. Before the prefetch can be discarded (or filled, or anything), it has to be reconciled against the write-combining buffers, the small set of per-core partial-line buffers that hold pending CPU writes destined for the WC region. Zen needs to determine whether the prefetched line overlaps any in-flight WC buffer to preserve memory ordering. That reconciliation isn't free; it serializes that lane of the pipeline until the WCB state is resolved.
The instruction is still a hint in the architectural sense, it doesn't trap, doesn't fault, doesn't read incorrectly. But architecturally-permissible behavior on Zen translates to a multi-cycle stall in practice, every single iteration of a tight loop.
This is also asymmetric in a way the code never expected: on the WC output buffer the loop is concurrently writing to, the prefetch and the actual store path are now competing for the same WCB tracking resources, which is exactly the worst-case interaction for that subsystem.
Multiply that by the inner-loop iterations × the light buckets × the four render passes (default + water reflection + mirror + cube) × 60 frames per second, and the stalls compound into the 1.6 ms cost the profiler kept pointing at.
Why the other prefetches are also wrong
There are six prefetchnta instructions in the function, three per inner loop, and two inner loops (one for street lights, one for non-street lights). The biggest offender is the WC-targeted output prefetch. The other two per loop target the per-light position and color arrays, those are normal cacheable memory, so they don't trigger the WC stall, but they're still useless:
- The lookahead distances are 16 and 48 bytes, both inside a single 64-byte cache line.
- If the line is already in cache, the prefetch is a no-op.
- If it isn't, DRAM latency is ~200–300 cycles and the consumer load arrives in ~10. The prefetch has no time to hide latency.
They contribute nothing. NOPing them removes a small overhead and removes any ambiguity about their interaction with the rest of the pipeline.
The fix
The whole patch is twelve lines of C++. It runs once at game startup, pattern-scans the binary for the two prefetch sequences, and overwrites 25 bytes at each site with nop:
#include "StdInc.h"
#include "Hooking.Patterns.h"
static HookFunction initFunction([]
{
auto alphaStreetPrefetch = hook::get_pattern("48 63 CB 41 0F 18 44 8B");
hook::nop(alphaStreetPrefetch, 25);
auto alphaNormalPrefetch = hook::get_pattern("48 63 CF 41 0F 18 44 8B");
hook::nop(alphaNormalPrefetch, 25);
});Both patterns lock onto 41 0F 18, the prefetch opcode with REX.B, preceded by the movsxd rcx, ebx/movsxd rcx, edi that prepares the index for the prefetch's SIB byte. That sequence is unique enough across the binary that the pattern resolver returns exactly one match per loop.
The 25 bytes per pattern cover the three back-to-back prefetchntas plus their setup loads. NOPing exactly that range preserves the surrounding instruction boundaries: the function's prologue, epilogue, and loop control are untouched. The CPU executes through 25 nops and continues into the original code path, which then runs without the WC stall.
Results
Measured against the stock binary on hardware that hits the bug clearly:
| Metric | Before | After |
|---|---|---|
| CPU time in function (avg) | 1.6 ms | 0.17 ms |
| Night FPS in the city, AMD | ~140 | ~240 |
| Night FPS in the city, Intel | baseline | +10–40 FPS |
The AMD numbers are the dramatic ones because that's where the WC prefetch stall hits. Intel users still gain, the function does strictly less work, just less spectacularly. Video here.
The patch shipped in FiveM PR #3984 (commit 770a0a7) and is now live in the FiveM client, so every FiveM player on AMD picked up a free 30%+ at night.
Thanks
- @divocbn for the contribution analysis and the help to make the patch.
- @radium_cfx for the guide to prefetch functions.
- @gogsi for testing the patch on a wide range of hardware.