Deep Dive

Stop ignoring visual tests because they fail for no reason

Flaky visual tests train teams to ignore results. Learn why screenshot comparisons fail randomly and how to build a visual testing workflow you can actually trust.

What visual testing flakiness actually is

Flakiness in visual testing means tests that fail without meaningful code changes. The screenshot looks different, but nothing important changed. These false positives are the primary reason teams abandon visual testing.

The pattern is predictable: a visual test starts failing, someone investigates, finds nothing wrong, and approves the new baseline. After this happens enough times, the team stops investigating—they just approve everything or disable the tests entirely.

Why teams end up disabling visual tests

Noisy tests aren't just annoying—they're actively harmful. When tests regularly fail for no reason, teams develop reasonable responses: skip them, auto-approve changes, or remove them from CI entirely.

This isn't a discipline failure. It's rational behavior in response to poor signal-to-noise ratio. The solution isn't to demand more rigor from reviewers—it's to eliminate the noise.

Common sources of flakiness

Font rendering differences

Different operating systems and browsers render fonts differently. Even the same browser on different machines can produce sub-pixel variations.

Animation and transition timing

Screenshots captured mid-animation produce inconsistent results. Spinners, skeleton loaders, and CSS transitions are common culprits.

Dynamic content

Timestamps, relative dates, random avatars, and live data change between test runs, creating meaningless diffs.

Rendering timing

Images loading, web fonts loading, or components hydrating can cause screenshots to capture incomplete states.

Environment differences

CI runners have different screen sizes, GPU capabilities, and system fonts than local development machines.

Third-party content

Ads, embedded widgets, and external images change independently of your code and create noise in visual diffs.

Notice that none of these are bugs in your application. They're all environmental or timing issues that create legitimate pixel differences without representing meaningful visual regressions.

Strategies for stabilizing visual tests

Control your rendering environment

Use containerized browsers with fixed viewport sizes, system fonts, and GPU settings. Docker-based CI pipelines help ensure consistency.

Wait for stability

Capture screenshots only after fonts load, animations complete, and network requests settle. Explicit wait conditions beat arbitrary timeouts.

Mock dynamic content

Replace timestamps with fixed values, seed random generators, and use deterministic test data to eliminate content-driven flakiness.

Test at the right granularity

Component-level snapshots are often more stable than full-page captures. Isolate what you're testing from unrelated visual noise.

The common thread is control. You need to control the rendering environment, control timing, and control the content being rendered. Without that control, pixel comparisons will always be unreliable.

Test granularity matters

Full-page screenshots capture everything—including things you don't care about. A header component update shouldn't fail every page test in your suite.

Component-level visual testing isolates what you're actually trying to protect. A button component test fails when the button changes, not when some unrelated page element shifts. This reduces noise and makes failures easier to diagnose.

For an overview of visual testing approaches and when to use them, see the visual regression testing guide.

Flakiness is a workflow problem

It's tempting to view flaky tests as a technical problem—better tooling, smarter diffing algorithms, machine learning to ignore irrelevant changes. These help at the margins, but they don't address root causes.

The real issue is workflow. Who decides what gets tested? Who reviews visual changes? How quickly do failures get triaged? Teams with stable visual tests invest in process, not just tooling.

Part of that process is involving the right people. Designer-approved visual testing helps by ensuring visual changes get reviewed by people with context to judge them.

Related guides

Frequently Asked Questions

Why are my visual tests flaky?
Visual test flakiness usually comes from rendering inconsistencies: font smoothing differences, animation timing, dynamic content, image loading races, or environmental differences between CI and local machines. The tests aren't wrong—they're detecting real pixel differences that don't represent meaningful changes.
Why do visual tests fail on CI but pass locally?
CI environments differ from local machines in GPU rendering, installed fonts, screen resolution, and browser versions. These differences create legitimate pixel variations that visual tests detect. The solution is either matching environments exactly or adjusting comparison thresholds.
Should I increase the diff threshold to reduce failures?
Threshold increases are a bandaid. They hide flakiness but also hide real regressions. It's better to address root causes: stabilize rendering, mock dynamic content, and test at appropriate granularity. Use thresholds sparingly and understand what you're trading off.
How do I handle animations in visual tests?
Either disable animations during test runs (via CSS or test configuration) or wait for animations to complete before capturing screenshots. Capturing mid-animation will always be inconsistent.
Why does font rendering cause visual test failures?
Different operating systems use different font rendering engines with different anti-aliasing algorithms. macOS, Windows, and Linux all render the same font file differently. Even different browser versions on the same OS can vary. Consistent CI environments and web fonts help reduce this.
Is visual testing flakiness a discipline problem?
It's tempting to blame team discipline, but flakiness is usually a workflow and infrastructure problem. Teams don't disable tests because they're lazy—they disable them because the signal-to-noise ratio is too low to be useful. Fixing the infrastructure is more effective than demanding more discipline.
How many false positives are acceptable?
Ideally zero. Every false positive erodes trust and trains the team to ignore results. If you're seeing regular false positives, address the root cause rather than accepting them as normal.
Can visual testing work reliably in CI?
Yes, but it requires intentional setup. Containerized browsers, deterministic test data, explicit stability waits, and appropriate test granularity can produce reliable visual tests. The teams that succeed invest in infrastructure stability, not just test coverage.

We're exploring a quieter approach to visual testing—join the waitlist

Get early access