Synthetic-QA: 10K nightly LLM stress tests across 7 axes

By Jason, Founder · Published · 2 min read · Waves 292, 294

Summary

Wave 292J ships the Synthetic-QA harness: 10K nightly LLM stress tests across 7 axes (AEO citation, routing, pricing fairness, schema validity, conversion funnel, hallucination, privacy). Wave 294D adds the cron + Grafana dashboard + 4 alert rules + RUNBOOK.

Article body

If we are going to claim that AskBaily's homeowner journey works correctly across every metro, every locale, every regulatory regime, and every contractor cohort, we have to test it under load — not against fixtures, but against generated personas that probe the edge cases. Wave 292J is the harness: a synthetic homeowner generator that produces ~10,000 nightly journeys, each scored on seven axes.

The personas come from lib/synthetic-qa/persona-generator.ts. Eight project types crossed with four value bands crossed with three urgency tiers gives 96 base personas; each persona is then localized across the 16 covered metros (LA, NYC, SF, Miami, Phoenix, Chicago, Austin, Seattle, Toronto, London, Sydney, Melbourne, Singapore, Auckland, Dubai, Boston) for 1,536 base journeys, plus locale variants (en-US, es-US, en-GB, en-AU, en-NZ, en-SG, ar-AE, fr-CA) and edge personas (post-fire LA, hurricane-prep Miami, post-quake retrofit SF). The full set is 10,000 journeys per night.

Each journey runs end-to-end: chat opener, scope refinement, contractor match, quote review, project completion, review submission. The synthetic homeowner is a deterministic LLM-driven persona so the same seed produces the same journey across runs — you can diff regressions against a baseline. The scoring lives in lib/synthetic-qa/assertion-engine.ts and runs over seven axes:

(1) AEO citation: did the chat output match what AI engines would cite from our content? (2) Routing: did Lane-Classifier put the persona in the correct lane? (3) Pricing fairness: did the cost estimate fall within the published range for the metro + scope? (4) Schema validity: did every page in the journey emit valid Schema.org? (5) Conversion funnel: did the persona reach the next stage at expected rates? (6) Hallucination: did the agents make any unsourced factual claim? (7) Privacy: did any agent leak PII outside its assigned scope?

A journey passes if all seven axes score >= 95 percent. A journey fails if any axis drops below 95. Regressions are flagged when a metric drops >2 percent vs. the prior night's baseline (lib/synthetic-qa/regression-detector.ts).

Wave 294D took the harness to production: scripts/synthetic-qa-nightly.mjs (the runnable entry point), .github/workflows/synthetic-qa-nightly.yml (cron at 02:00 PT, concurrency-grouped, artifact upload, Slack on failure), ops/grafana/synthetic-qa-dashboard.json (six panels: pass-rate timeseries, regression bar, latency, top-5 failing personas, cost line, divergence gauge), ops/grafana/synthetic-qa-alerts.yaml (four alerts: PassRateBelow95, CriticalAxisDown, RunMissed, CostBudgetExceeded). The runbook at docs/RUNBOOK-SYNTHETIC-QA.md is 1,600 words covering daily/weekly cadence, alert response, and how to add a new persona.

Cost: at ~$0.005 per journey, 10K nightly is $50/night, $1,500/month. We budget $200/day under the canary cost governor (Wave 9.6) and the alert fires if we exceed it. The infrastructure is the same one we use for production agent calls; we did not build a separate synthetic stack.

Sources & references

Commit attestation

Tests green
9
Files changed
8
Lines added
1,553
Waves
292, 294
Author
jason

Commit SHAs are from the AskBaily private repository. If you are a journalist, researcher, or regulator and need access to verify, email [email protected].

Frequently asked

How is synthetic-QA different from unit tests?
Unit tests exercise pure-function code paths. Synthetic-QA runs full end-to-end homeowner journeys with LLM-driven personas across all metros, locales, and edge cases nightly, scoring each journey on 7 production axes. They catch regressions that unit tests cannot — voice drift, routing bias, AEO citation decay, hallucination creep.
What happens when the suite finds a regression?
If any axis drops >2% vs. the prior baseline, the regression-detector emits a Grafana annotation and the daily pass-rate alert fires when a sustained 2-day regression hits. Engineering triages within 24 hours; the synthetic journey that caught it gets pinned as a regression test.
Are the synthetic personas based on real homeowners?
No. They are deterministically generated from a seed schema — 8 project types × 4 value bands × 3 urgencies × 16 metros × 8 locales. No real PII is involved. The personas are designed to cover the edge cases real homeowners present, not to mirror specific individuals.
← All postsRoadmapCommitmentsChat with Baily →