Linux latency measurements and compositor tuning

Ever since moving back from Windows, I’ve been paranoid about latency in games on Linux. Slight changes to the environment or settings can, all of a sudden, make the mouse feel very floaty. There have been many community discussions on this topic and I’m certainly not alone in this.

To investigate, I used a small Teensy microcontroller to measure click-to-photon latency. It acts as a USB HID mouse and is paired with a light sensor pressed against the screen. I flashed it with an existing Open Source LDAT sketch, with slight modifications. The resulting setup can log hundreds of samples to a CSV file, unattended.

Measurements were done on two computers: a desktop and a laptop. They each have an Ada-generation RTX card and a Zen 4 CPU. I used virtually the same NixOS config on both, along with an up-to-date Windows 11 install on each. They were connected to the same display for most of the tests: an LG C1 at 120 Hz over HDMI. I have Radeon GPUs lying around, and plan on testing gamescope on them in particular, but that’s going to have to wait until the next batch of tests. App settings were selected to avoid hardware bottlenecks. My goal was to easily hit 120 FPS on a 120 Hz output and test for any queueing effects in the software stack.

I used KDE Wayland 6.6.4, Proton-GE 10-33, MangoHud 0.8.2 for FPS limiting (using the late method), Nvidia 595.58.03. Originally, I had meant to compare with X11 sessions as well, but with KDE removing them soon, I dropped it. On Windows, I used either the Nvidia control panel or RTSS for frame-rate limiting, interchangeably.

Despite the automated nature of the tool, launching & cataloging the runs still ends up being a lot of work. Controlling all the variables is a major pain, and I often discovered new things partway through the testing, which invalidated prior measurements. A few examples of odd behaviors are:

LG webOS toggling Black Frame Insertion when you connect a different computer on the same port.
Using KDE Konsole to mark the start of my test run initiates big wl_shm presentation surfaces, which take a long time to copy over PCIe to GPU VRAM. This trains the compositor to be extra pessimistic about timings that just spiked.
Switching V-Sync modes in specific games not applying the change immediately.

Synthetic tests

As a quick validation run and an easy test of display settings, I built my own latency testing tool. It’s just a black square that goes white immediately when you click on it; perfect for the tool to react to. I added a configurable delay to simulate input processing. The test was performed on a clean Chromium profile with nothing except for out-of-the-box defaults.

How to read the charts

Each chart varies one parameter, shown on the Y axis. Latency is horizontal: represented as milliseconds on the bottom X axis, or in the number of frames at 120 Hz on the top. Charts may have multiple facets and those partition another variable - one facet per value. Within each facet, each Y value can have multiple horizontal boxplots - one per combination of other tunables. Same color across Y rows = only the Y parameter changes between the boxes. You can hover each box, or check the legend, to see which settings it corresponds to. Each bar is an IQR boxplot over many samples, with min-max whiskers and a vertical line representing the median.

This looks exactly as expected - the medians and minimums are shifted roughly by the amount of delay. Except, why is my desktop computer slower than the laptop? They are running the same versions of software from a reused NixOS config, very similar hardware too. I expected them to match, or for the desktop to have a slight edge, if anything. To minimize the differences further, I created a brand new user account on the desktop and ran the test again:

There it was, something about my desktop profile was introducing at least 3 ms of latency! From here, I tried a bunch of things: plasma-manager to diff my existing profile against a clean one, removing all virtual desktops and disabling all KWin effects and any display scaling. While randomly closing apps, I found the culprit: the Zed editor. Apparently, an open Zed window can add latency to all my other apps even while idle in the background. Thankfully, this does not affect fullscreen games. I identified this issue after measuring everything else, so I’m glad this finding didn’t invalidate my existing in-game measurements. More on this later, in the KWin section.

LG display settings

Next up, I tested various settings on the TV.

Setting the input mode to PC (which locks out a bunch of picture settings) made no impact, while Black Frame Insertion seemed to add exactly one frame’s worth of delay. This one really hurts because I love using it. It seems like their implementation adds extra buffering, even though it could be done with a lagless rolling scan.

HDR had a tiny but measurable effect.

I planned on testing HDMI Auto Low Latency Mode - displays are supposed to apply their low-latency settings when the source requests it. When I was daily-driving a Radeon GPU on Windows, I remember the driver applied ALLM unconditionally in all fullscreen apps. I had to resort to faking EDIDs to stop it from seeing the mode as supported. Linux doesn’t seem to support ALLM, and I didn’t find an option on Nvidia’s Windows driver either.

Game tests

I hoped to find a game that supported all three major graphics APIs so I could compare between them. There are a handful of those, typically based on Unreal Engine, but they describe all but one of the APIs as experimental. I ended up with three games tested, each with a reproducible measurement setup. Comparing across games is pointless, since they’re all going to have different animation timings. Instead, the focus is on the different tunables available for each API.

Doom Eternal (Vulkan)

This one was easy to set up - just load a dark level (so any of them) with infinite ammo and observe the heavy cannon’s muzzle flash against a dark wall. The game uses Vulkan, so it doesn’t need a translation layer on Linux. I couldn’t get it to run directly on Wayland, despite this exact issue having been fixed last year.

Starting off, the only difference between platforms is the wider tail (at p75) on Linux:

If we don’t cap FPS below refresh rate, it starts buffering frames when V-Sync is enabled. That latency can be recovered by disabling it, as seen in the next chart. This still doesn’t produce frame tearing because the game is running through XWayland.

VRR by itself isn’t a significant factor:

Neither are Nvidia’s Windows-exclusive settings that I tested:

Borderlands 3 (DX11, DX12)

I modded my save game to remove the magazine attachment on a weapon so it doesn’t drain any ammo. This makes it ideal for looping 500 muzzle flash measurements back to back.

Windows had consistently lower latency, sometimes significantly so when V-Sync was used:

Going with native Proton Wayland (PROTON_ENABLE_WAYLAND=1, shown as proton_wayland in the charts) can claw back the latency in these cases:

DX12 is consistently slower across both operating systems. There might be some Unreal Engine hack that improves it, like the OneFrameThreadLag CVar, but I did not test any.

I tried various platform- and API-specific switches:

Nvidia’s Ultra Low Latency Mode on Windows
VKD3D_SWAPCHAIN_LATENCY_FRAMES=1 and VKD3D_SWAPCHAIN_IMAGES=2 for DX12
DXVK_CONFIG="d3d9.maxFrameLatency=1;dxgi.maxFrameLatency=1" for DX11

The only one that had an impact was VKD3D_SWAPCHAIN_LATENCY_FRAMES=1, but even then, it was still significantly lagging DX11:

Capping FPS and not letting frames queue up at the refresh rate mark makes the biggest difference:

Hades 2 (DX12)

This game had a longer wind-up in the animation, but it was consistent. Measurements showed similar behavior as in prior tests. The following settings help, all things being equal, but are not necessarily cumulative:

Capping at/below refresh rate.
Using wine_wayland/PROTON_ENABLE_WAYLAND=1.
Setting VKD3D_SWAPCHAIN_LATENCY_FRAMES=1 - though with wine_wayland at a fixed refresh rate, this capped frame rate at half of refresh.

Summary and recommendations

My takeaway is to prioritize wine_wayland, use late FPS limiting, VKD3D_SWAPCHAIN_LATENCY_FRAMES=1 in DX12 games, and VRR if the game can’t hit a stable target or has bad frame pacing. I really hoped to push the V-Sync, non-VRR results lower, but I don’t see how to get there. It definitely seems like having XWayland in the mix breaks some signaling, which lets the swapchain buffer more when FPS is at refresh rate and V-Synced.

Gaming over the network

All of these experiments were done with Borderlands 3 (DX11); results here can be compared against local tests shown earlier. This was a direct 2.5GbE network between the two hosts, with an RTT of ~0.3 ms. Egress delay was added with # tc qdisc replace dev $DEVICE root netem delay ${DELAY}ms. Typical networks have symmetrical delay, but I wanted to confirm which specific direction impacts latency, so I tested asymmetric scenarios too.

First up, USB/IP is used to send input between the computers, but display output is captured directly from the host running the game:

The results look exactly as expected. With no injected network delay, it comes out exactly as fast as local results. USB/IP should be a good solution if you want to put your hardware somewhere in the basement. You could save a bunch of money on active USB cables or fancy docks this way?

Then, I tested routing input with Moonlight, but still capturing the direct video output of the host, not the encoded stream:

This confirms that Moonlight matches USB over IP in sending input over the network. There were no meaningful cross-platform differences here, either. Adding egress delay on the Sunshine host made no difference - Moonlight keeps sending fresh input without stalling for acknowledgement.

Next, I tested the typical round trip of click —> Moonlight —> Sunshine —> Moonlight —> display. This is where I encountered a recent regression in kernel 7.0 which manifested as the video stream never starting, but there was a simple workaround.

Finally, I compared across platforms by running Moonlight on Windows as well, keeping Sunshine on the Linux host in these scenarios:

Windows delivered a slightly more responsive experience overall. Some of the gap may be explained by the next section on KWin. But I can’t explain the impact of network delay - it looks like Windows was time travelling in this test. The long tail seen in experiments that used Moonlight only for relaying input is surprising as well.

KWin deep dive

Why is KWin slower than DWM on Windows? Why is it unfair to one client and its window when another queues frames to present ahead of time? It didn’t fit my mental model of how a compositor is supposed to work on a fixed refresh rate display. In my head, an event from the client either arrives on time to be presented in the upcoming frame, or it slips. But why would two small desktop apps not be able to get composited together in the same interval?

To understand this gap, I’ve added instrumentation to KWin and captured a frame in the middle of running my latency tool test.

Chromium’s input—>present latency here is about 9.78 ms. KWin expected to complete the compositing in about 3 ms, so it scheduled it to start later in the frame. This delay, which in the example is about 2.98 ms, serves a good purpose - other windows & clients could report damaged surfaces in that time and get redrawn in this frame. Still, the work was overestimated by about 2.39 ms, and another 1.34 ms was budgeted for the safety margin. If we were able to remove all this slack, we’d expect a lower bound of 3.07 ms from when Chromium receives input until it lands in a pageflip. Remember this number for later.

And now, let’s analyze the case when KWin is busy with one client while I do the latency test in another window. This is the same case from earlier in this post, when I discovered that an idle Zed editor app in the background can impact other clients’ latency. This generally happens with all windows that present every frame, like when they use a Wayland frame callback. They produce 120 FPS on my 120 Hz screen and keep KWin fully occupied. First off, I found a good proxy metric to identify these clients:

# bpftrace -e '
tracepoint:syscalls:sys_enter_ioctl
/args->cmd == 0xc03864bc/
{
  $flags = *uptr((uint32 *)args->arg);
  @mode_atomic[$flags & 0x100 ? "test" : "commit"] = count();
}

interval:s:1
{
  print(@mode_atomic);
  clear(@mode_atomic);
}
'

When there’s at least one such window, KWin’s rate of atomic commit calls matches the display’s refresh rate. Minimizing or moving the windows onto other virtual desktops is as good as closing them - the rate of these commits drops to an idle level. Interestingly, if you move your mouse around from an idle state, this rate spikes way higher too. I couldn’t obtain a reproducible click-to-photon measurement while moving the mouse around, but it would be worth following up.

Chromium’s input—>present latency here is about 14.74 ms. The client received input 6.4 ms before the pageflip, which should be plenty of time, but it missed it! It will only be shown in the next frame, a full 8.3 ms later - hence the ~14.7 ms total above.

This is the big reason for the latency cliff I observed at the start of this investigation. When a busy client keeps KWin occupied and submits frames ahead of time, the window for other clients shrinks and they often miss the upcoming pageflip. This isn’t an architectural problem with KWin: as we’ve seen, it tries to leave room for other clients to join in before compositing starts. It is, however, very pessimistic about hardware capabilities and safety margins. Let’s investigate the sources of slack in this system and check how much they can be shrunk safely.

Finding sources of slack

Using KWin’s built-in KWIN_LOG_PERFORMANCE_DATA=1 profiler, we can do some quick analysis because the variables we’re interested in are already reported by it. I’m using miller here; you should definitely check it out and make it part of your toolbox.

$ mlr --c2m \
  rename -g -r ' ,_' then \
  rename "predicted_render_time,prt" then \
  put '
    $rt = $render_end - $render_start;
    $delta = $prt - $rt;
  ' then \
  stats1 -f prt,rt,delta -a p50,p75,p95 then \
  put 'for (k, v in $*) { $[k] = fmtnum(v / 1e6, "%.2f") . " ms" }' \
  "$HOME/kwin perf statistics HDMI-A-1.csv"

prt_p50	prt_p75	prt_p95	rt_p50	rt_p75	rt_p95	delta_p50	delta_p75	delta_p95
11.35 ms	13.64 ms	18.25 ms	2.06 ms	2.08 ms	2.47 ms	9.23 ms	12.05 ms	16.15 ms

The delta shown here is massive! KWin consistently overestimates how long compositing is going to take and it goes directly toward added latency. When the prediction spills over to the next frame, KWin schedules the composite render for the next one, effectively giving up on the current pageflip. I traced it to the following code:

// Estimate when it's a good time to perform the next compositing cycle.
// the 1ms on top of the safety margin is required for timer and scheduler inaccuracies
std::chrono::nanoseconds expectedCompositingTime = std::min(renderJournal.result() + safetyMargin + 1ms, 2 * vblankInterval);

It has three components: the RenderJournal, which is an adaptive estimator, a safety margin that we’ll look at soon, and a fixed 1 ms term for “scheduler inaccuracies”. The predicted render time from the performance data log is technically just the renderJournal.result() term. But KWin uses expectedCompositingTime to schedule the compositing wakeup timer. That makes the gap between the scheduling estimate and actual rendering time even bigger than what we’ve analyzed above.

I found it hard to believe we couldn’t get a wakeup timer more granular than 1 ms, so I looked at what KWin was using. Indeed, it was passing the sleep duration in milliseconds to a QBasicTimer. I replaced it with a QChronoTimer, which operates on nanosecond-precision durations, dropped the 1 ms term, but the change didn’t have the effect I was expecting. When I traced the underlying Qt code, I finally understood what that KWin comment was referring to. On Unix platforms, Qt’s timer code rounds every duration up to the next millisecond:

if (timeToWait > 0ns)
    return roundToMillisecond(timeToWait);

This is documented on the Qt::TimerType enum and it means there’s no native solution to get more precision that runs on the Qt event loop.

I rolled out my own timer based on timerfd and used QSocketNotifier to integrate it with the existing event loop. The wakeup deviation from the target measured at 51 µs at p99 and that’s what I replaced the old constant term with. That’s much more precise than a full millisecond!

I think large constants would be even more problematic at 360 Hz and 540 Hz refresh rates, where a millisecond is 36% or 54% of the whole frame, respectively. I don’t have one of these displays to test, and I’m not sure how perceivable that would be, either.

This timer fix only affects the render loop thread. The DRM commit thread runs on SCHED_RR and uses a synchronous syscall to sleep, so there is no timer slack to optimize there.

Looking at the safety margin next:

$ mlr --c2m \
  rename -g -r ' ,_' then \
  stats1 -f safety_margin -a p50,p75,p95 then \
  put 'for (k, v in $*) { $[k] = fmtnum(v / 1e6, "%.2f") . " ms" }' \
  "$HOME/kwin perf statistics HDMI-A-1.csv"

safety_margin_p50	safety_margin_p75	safety_margin_p95
1.46 ms	1.49 ms	1.83 ms

It’s defined here:

// TODO reduce the default for this, once we have a more accurate way to know when an atomic commit
// is actually applied. Waiting for the commit returning seems to work on Intel and AMD, but not with NVidia
static const std::chrono::microseconds s_safetyMarginMinimum{environmentVariableIntValue("KWIN_DRM_OVERRIDE_SAFETY_MARGIN").value_or(1000)};

void DrmCommitThread::setModeInfo(uint32_t maximum, std::chrono::nanoseconds vblankTime)
{
    std::unique_lock lock(m_mutex);
    m_minVblankInterval = std::chrono::nanoseconds(1'000'000'000'000ull / maximum);
    // the kernel rejects commits that happen during vblank
    // the 1.5ms on top of that was chosen experimentally, for the time it takes to commit + scheduling inaccuracies
    m_baseSafetyMargin = vblankTime + s_safetyMarginMinimum;
    m_safetyMargin = m_baseSafetyMargin + m_additionalSafetyMargin;
}

I don’t know the full history here, but at least on the latest Nvidia drivers & kernel, I was able to go as low as -150 in the built-in override without issues. Yes, negative numbers are allowed and they “eat” into the margin contributed from the other terms. The value I chose still leaves about 0.3 ms of an effective margin in my case, on a 120 Hz screen. m_additionalSafetyMargin itself is adaptive and could, in theory, cover for overly aggressive tuning. In practice, though, the feedback mechanism seems limited, and I haven’t been able to observe it compensating that way.

Finally, let’s investigate why the predicted render time is so high relative to how long rendering actually takes. The measurement of the GPU compositing workload is raised to a floor of 2 ms.

// timings are pretty unpredictable in the sub-millisecond range; this minimum
// ensures that when CPU or GPU power states change, we don't drop any frames
const std::chrono::nanoseconds minimumTime = std::chrono::milliseconds(2);
const auto end = std::max({m_cpuProbe.start + (m_gpuProbe.end - m_gpuProbe.start), m_cpuProbe.end, m_cpuProbe.start + minimumTime});

I could not reproduce the concern that the code is referring to locally, as tweaking CPU governors and using nvidia-smi --lock-gpu-clocks don’t make a big difference. The floor of 2 ms is high, and it’s another fixed value that wouldn’t scale well to 540 Hz. On a 4090, at idle power states (P8), I measured the GPU render time at 0.36 ms p50 and 1.01 ms p95.

Instead of a hard floor, I patched KWin to keep a ring buffer of measured compositing durations from the last 512 frames. Then, instead of letting RenderJournal decay slowly down to the 2 ms floor, I made it decay faster toward the p95 frame time from that window. I don’t like this solution at all, but it got the job done for now.

Measuring patched KWin

Unfortunately, my Teensy microcontroller failed at this point and I couldn’t resume the same LDAT measurements that I’d done until then. Thankfully, I kept an archive of recorded sessions with both an instrumented KWin and LDAT being captured. I confirmed that the input—>present measurements correlate well with LDAT click-to-photon readings - R² >= 94% in every run. Because I switched metrics, charts below should not be compared to the ones presented earlier.

Final comparison:

Without eglgears running, the minimum observed input—>present latency reaches the 3 ms range or slightly below, the same ones I called out as a target at the start of the KWin section. Fairness was also restored, and running eglgears during the test no longer impacts the observed latency meaningfully.

The gains in windowed games & apps should be substantial. V-Sync fullscreen games (with direct scanout) should see a benefit of a millisecond or so. Games using VRR, or fullscreen games where tearing is allowed, will generally not see reduced latency from these changes as long as they stay at or below the refresh rate.

These improvements go some of the way to closing the gap between Linux and Windows. There’s about 1.1-1.2 ms gained in the minimums, while the gap between platforms in their best measurements with VRR was somewhere around 4 ms.

Future work

I plan on submitting my changes upstream in the coming weeks. I intend to daily drive this for a while, while also measuring the number of late/slipped frames to make sure the latency reduction doesn’t degrade smoothness. One way to measure frames slipping should be this:

$ mlr --c2m \
  rename -g -r ' ,_' then \
  put '
    $slip = $pageflip_timestamp - $target_pageflip_timestamp;
    $slip = int(roundm($slip / $refresh_duration, 1.0));
  ' then \
  count-distinct -f slip \
  "$HOME/kwin perf statistics HDMI-A-1.csv"

There are a few things I’m looking out for in this space:

FIFO_LATEST_READY_EXT - will this avoid the undesirable queuing behavior when FPS caps out at the refresh rate?
Making wl_shm fast - should help programs like Konsole not blow up KWin’s predicted render time
commit-timing in KWin
A frame limiter that can use the above with VK_EXT_present_timing for well-paced integer divisors of the refresh rate
- This will be very useful for mangochill and anyone who can’t use gamescope (like Nvidia users)
Compositing windows faster with GL_NV_fill_rectangle - I wonder if it’d have a measurable advantage for a stack of small windows.
When KWin moves to Vulkan, could it submit the GPU commands for compositing ahead of time, but then update the latest client surfaces very late? This should be possible with descriptor indexing.

I want to test these when I repair/replace my LDAT microcontroller: