Why A/B Testing for AI Search is Hard
Traditional A/B testing assumes a deterministic system: show variant A to half your visitors, variant B to the other half, measure conversion rates. The system behaves the same for every user.
AI engines are different. Run the same query twice and you may get different answers. The model's output is influenced by temperature parameters, context from the conversation, the user's location and history, and real-time retrieval variability. A single query run tells you almost nothing about whether your brand is consistently cited.
Querying ChatGPT once and seeing your brand cited does not mean your brand is consistently cited. It means you were cited that one time. Citation rate requires multiple samples — the more, the more statistically reliable your measurement.
The Statistical Foundation
AI citation tracking uses the same statistical principles as conversion rate optimisation. Your citation rate is a proportion — the fraction of sampled queries in which your brand appears. To get a reliable estimate of that proportion, you need a sufficient sample size.
Wilson Confidence Intervals
The Wilson score interval is the standard method for estimating proportions from small samples. GetCited uses Wilson 95% confidence intervals for all citation rate measurements, meaning you can trust the reported rate with 95% statistical confidence.
Required Sample Size
For a citation rate around 50%, you need approximately 100 samples to get a confidence interval of ±10 percentage points at 95% confidence. For lower citation rates (10–20%), you need more samples to achieve the same precision.
| Target precision | Sample size needed (50% citation rate) | Sample size needed (20% citation rate) |
|---|---|---|
| ±15 percentage points | 45 samples | 70 samples |
| ±10 percentage points | 100 samples | 160 samples |
| ±5 percentage points | 385 samples | 615 samples |
In practice, GetCited's Multi-Query Sampling methodology runs each tracked query multiple times per day, accumulating samples rapidly enough to achieve ±10 percentage point precision within 7–14 days of monitoring.
How to Design an AEO A/B Test
1. Establish a Baseline
Before making any changes, monitor your target queries for at least 7 days. Record your citation rate across all engines for the queries you want to improve. This is your pre-treatment baseline.
2. Isolate One Variable
Make exactly one change per test. If you change the page schema, the content structure, and the title simultaneously, you cannot attribute any citation rate change to a specific cause. One change at a time.
3. Set Your Monitoring Period
Run post-change monitoring for the same number of days as your baseline period. For most citation rate changes, 7–14 days of daily monitoring is sufficient to detect a 10+ percentage point shift.
4. Evaluate with Statistical Significance
Compare pre-change and post-change citation rates. Use a two-proportion z-test or equivalent to determine whether the change is statistically significant at your chosen confidence level (95% is standard).
- p-value < 0.05: the change is statistically significant at 95% confidence
- p-value 0.05–0.10: marginal significance — extend your monitoring period
- p-value > 0.10: no statistically significant effect — the change did not move citation rate
Run structured AEO tests with built-in statistical analysis.
GetCited's A/B testing module automates baseline measurement, change monitoring, and Wilson CI verdict.
What Changes to Test (and in What Order)
Not all AEO changes have the same expected impact on citation rate. Test in order of expected effect size — start with changes that are most likely to produce large, detectable shifts.
High Expected Impact
- Adding FAQPage schema to a page with no schema markup
- Restructuring a page to answer-first (answer in first two sentences of every major section)
- Adding a page to your llms.txt or removing AI crawler blocks from robots.txt
- Creating a Wikipedia or Wikidata entity for your brand (if you don't have one)
Medium Expected Impact
- Improving schema completeness (adding missing fields to existing schema)
- Rewriting content to include more specific, citable facts and statistics
- Adding author credentials and E-E-A-T signals to existing content
- Building community mentions on Reddit and Quora for target queries
Lower Expected Impact
- Title and meta description changes
- Internal linking improvements
- Image alt text and figure captions
- Minor content refreshes without structural changes
Test high-impact changes first. Once you've validated your highest-leverage interventions, maintain a testing backlog ranked by expected effect size so your testing programme always focuses on the changes most likely to move citation rate.