Coming SoonSelf-serve platform launching 1st July 2026, built on the same 7-pillar framework.Get early access
← BlogA/B Testing for AI Search: The Complete Guide
Tactics12 min readMay 10, 2026

A/B Testing for AI Search: The Complete Guide

AI engines are probabilistic systems — they don't return the same answer every time. Most brands make AEO changes and declare victory or defeat based on a single observation. Here's how to measure correctly.

Why A/B Testing for AI Search is Hard

Traditional A/B testing assumes a deterministic system: show variant A to half your visitors, variant B to the other half, measure conversion rates. The system behaves the same for every user.

AI engines are different. Run the same query twice and you may get different answers. The model's output is influenced by temperature parameters, context from the conversation, the user's location and history, and real-time retrieval variability. A single query run tells you almost nothing about whether your brand is consistently cited.

The single-run fallacy

Querying ChatGPT once and seeing your brand cited does not mean your brand is consistently cited. It means you were cited that one time. Citation rate requires multiple samples — the more, the more statistically reliable your measurement.

The Statistical Foundation

AI citation tracking uses the same statistical principles as conversion rate optimisation. Your citation rate is a proportion — the fraction of sampled queries in which your brand appears. To get a reliable estimate of that proportion, you need a sufficient sample size.

Wilson Confidence Intervals

The Wilson score interval is the standard method for estimating proportions from small samples. GetCited uses Wilson 95% confidence intervals for all citation rate measurements, meaning you can trust the reported rate with 95% statistical confidence.

Required Sample Size

For a citation rate around 50%, you need approximately 100 samples to get a confidence interval of ±10 percentage points at 95% confidence. For lower citation rates (10–20%), you need more samples to achieve the same precision.

Target precisionSample size needed (50% citation rate)Sample size needed (20% citation rate)
±15 percentage points45 samples70 samples
±10 percentage points100 samples160 samples
±5 percentage points385 samples615 samples

In practice, GetCited's Multi-Query Sampling methodology runs each tracked query multiple times per day, accumulating samples rapidly enough to achieve ±10 percentage point precision within 7–14 days of monitoring.

How to Design an AEO A/B Test

1. Establish a Baseline

Before making any changes, monitor your target queries for at least 7 days. Record your citation rate across all engines for the queries you want to improve. This is your pre-treatment baseline.

2. Isolate One Variable

Make exactly one change per test. If you change the page schema, the content structure, and the title simultaneously, you cannot attribute any citation rate change to a specific cause. One change at a time.

3. Set Your Monitoring Period

Run post-change monitoring for the same number of days as your baseline period. For most citation rate changes, 7–14 days of daily monitoring is sufficient to detect a 10+ percentage point shift.

4. Evaluate with Statistical Significance

Compare pre-change and post-change citation rates. Use a two-proportion z-test or equivalent to determine whether the change is statistically significant at your chosen confidence level (95% is standard).

  • p-value < 0.05: the change is statistically significant at 95% confidence
  • p-value 0.05–0.10: marginal significance — extend your monitoring period
  • p-value > 0.10: no statistically significant effect — the change did not move citation rate

Run structured AEO tests with built-in statistical analysis.

GetCited's A/B testing module automates baseline measurement, change monitoring, and Wilson CI verdict.

What Changes to Test (and in What Order)

Not all AEO changes have the same expected impact on citation rate. Test in order of expected effect size — start with changes that are most likely to produce large, detectable shifts.

High Expected Impact

  • Adding FAQPage schema to a page with no schema markup
  • Restructuring a page to answer-first (answer in first two sentences of every major section)
  • Adding a page to your llms.txt or removing AI crawler blocks from robots.txt
  • Creating a Wikipedia or Wikidata entity for your brand (if you don't have one)

Medium Expected Impact

  • Improving schema completeness (adding missing fields to existing schema)
  • Rewriting content to include more specific, citable facts and statistics
  • Adding author credentials and E-E-A-T signals to existing content
  • Building community mentions on Reddit and Quora for target queries

Lower Expected Impact

  • Title and meta description changes
  • Internal linking improvements
  • Image alt text and figure captions
  • Minor content refreshes without structural changes
Efficiency principle

Test high-impact changes first. Once you've validated your highest-leverage interventions, maintain a testing backlog ranked by expected effect size so your testing programme always focuses on the changes most likely to move citation rate.

Related articles

Tactics

Citation Drift: Why Your AI Visibility Changes Daily

Why your citation rate fluctuates and what you can do about it.

Read more →
Fundamentals

How to Track AI Citations for Your Brand in 2026

The complete tracking playbook — from manual methods to automated monitoring.

Read more →
Platform

GetCited vs Manual AEO: ROI Comparison

When a platform makes more sense than doing it yourself — and when it doesn't.

Read more →

Stop guessing. Start testing.

GetCited's A/B testing engine runs statistically rigorous citation experiments automatically. No spreadsheets, no p-value calculators.

Start free audit →See real results

Free · No credit card · Results in 60 seconds