Voice + AR + photo-driven scope predictor: multimodal homeowner scope inference

By Netanel Presman, General Contractor (CSLB #1105249) · Published · 2 min read · Wave 293

Summary

Wave 293A ships voice + AR + photo-driven scope predictor: speech-to-text and text-to-speech bridges (lib/voice/), photo classifier and AR overlay (lib/scope-predictor/) that infer remodel scope from a single photo and surface permits, code, and material specs as overlay. Multimodal homeowner intake.

Article body

The fastest scope conversation is the one the homeowner does not have to type. Wave 293A ships three multimodal entry points: voice (the homeowner talks instead of typing), AR overlay (we annotate their photo with the regulatory and code constraints we infer), and photo-driven scope predictor (a single kitchen photo becomes a full scope draft).

lib/voice/whisper-bridge.ts is the speech-to-text adapter. We use OpenAI Whisper for English and the multilingual variants for Spanish (LA + Miami + Phoenix), French (Quebec City), German (Berlin), Italian (Milan, Rome), and Arabic (Dubai). The bridge is a thin wrapper; we did not retrain. lib/voice/tts-bridge.ts is the text-to-speech adapter for outbound responses on the 1-833-ASKBAIL voice line (Wave 294F). Voice mode in chat shows a microphone icon next to the input box; the homeowner clicks, speaks, and we transcribe in-line with confidence scoring. If transcription is uncertain (confidence below 0.85), we render the partial as text and ask the homeowner to confirm.

lib/scope-predictor/photo-classifier.ts is the multimodal classifier. The homeowner uploads a single photo of the room they want to remodel; the classifier identifies the room type, fixtures present, structural cues (load-bearing walls, exposed plumbing, ceiling height, window count), age cues (cabinet style, fixture vintage, flooring type), and condition cues (water damage, dated tile, original vs. retrofitted). It returns a structured scope skeleton — "1958 ranch kitchen, 144 sq ft estimate, original cabinets, exposed gas line, no island, west-facing window over sink, single-bowl porcelain sink, vinyl flooring with subfloor exposure visible, likely asbestos sheet flooring under" — that the homeowner can edit before saving.

lib/scope-predictor/ar-overlay.ts is the regulatory annotator. Once the photo is classified, we overlay the relevant regulatory and code references on the image: an HPOZ icon if the property sits in an LA HPOZ (Wave 292G personalization), a Title 24 envelope reference for the energy upgrade, a Mansionization callout if the addition pushes the property close to the threshold. The overlay is a JSON payload the front-end renders over the image; the back-end never modifies the original photo.

The fusion path is multimodal — Claude vision for the room classification, Gemini vision as a second opinion for high-stakes calls (structural, asbestos), and the regulatory data from Wave 292G as the deterministic overlay layer. We never trust a single vision model alone for material safety calls (asbestos under vinyl, lead paint on pre-1978 trim); those flag a "professional inspection recommended" annotation and we do not pretend to know.

18 tests cover the photo-classifier shape contract, the AR overlay coordinate math, and the voice-bridge timeout handling. The harder validation is the synthetic-QA suite (Wave 292J), which runs the photo-driven scope on a fixture set of 200 LA, NYC, Miami, Phoenix, and Toronto kitchen + bath + ADU photos and scores the scope skeleton against a human-labeled ground truth. Current accuracy on room-type classification is 0.96; on year-built inference from cabinet/fixture cues it is 0.78; on structural-cue identification (load-bearing wall, exposed plumbing) it is 0.91. We publish those numbers honestly because pretending the model is better than it is would damage homeowner trust.

Patent provisional 05 (Wave 294A) covers this multimodal scope predictor design.

Sources & references

Commit attestation

Tests green
18
Files changed
6
Lines added
1,191
Waves
293
Author
netanel

Commit SHAs are from the AskBaily private repository. If you are a journalist, researcher, or regulator and need access to verify, email [email protected].

Frequently asked

Does the photo-classifier replace a real inspection?
No. The classifier produces a scope skeleton the homeowner edits before saving. For material safety calls (asbestos, lead paint), the overlay annotates 'professional inspection recommended' rather than committing to an answer the model cannot verify. We always recommend an inspection before any work involving pre-1978 paint or pre-1990 vinyl flooring.
How accurate is the structural-cue identification?
0.91 on the synthetic-QA fixture set. False-negatives (we miss a load-bearing wall the photo shows) are caught at the contractor's first walk-through; false-positives (we flag a non-load-bearing wall) just over-scope the project until the contractor corrects. We publish accuracy honestly on /transparency.
What languages does voice mode support?
English (US, GB, AU, NZ), Spanish (US, ES, MX), French (CA, FR), German (DE, AT, CH), Italian (IT), Arabic (AE), and Portuguese (BR, PT). Each is the OpenAI Whisper multilingual variant; we did not retrain. Other languages fall back to text input until the locale is supported.
← All postsRoadmapCommitmentsChat with Baily →