Kairo AI Receptionist Engineering Teardown: Multi-Channel Architecture

Small service businesses (clinics, salons, law firms, real estate agencies) lose 30–45% of inbound leads to missed calls and after-hours inquiries. Generic chatbots make the problem worse — irrelevant responses, no booking authority, no sense of when to escalate. Kairo solves this with a single AI agent that handles WhatsApp, website chat, and voice from one reasoning engine.

This is the engineering teardown of Kairo AI Receptionist — the multi-channel adapter pattern, why we run a sub-80ms intent classifier before the main LLM, and the escalation logic that prevents AI hallucinations on appointment booking.

The Problem

Service businesses with 1–10 employees can't afford a full-time receptionist (₹18K–₹35K/month, doesn't cover weekends), but they desperately need 24/7 coverage. The customer flow is identical across industries:

Customer asks a basic question (hours, pricing, services)
Customer asks a specific question (do you treat this condition / handle this case?)
Customer wants to book an appointment
Customer has an unusual situation requiring human judgment

Off-the-shelf chatbots fail at steps 2–4. Pre-trained LLMs without industry context hallucinate at step 2 and over-promise at step 3. Generic intent classifiers escalate too aggressively at step 4, defeating the cost savings.

The Architecture

A multi-layer AI system:

Fast intent classifier (< 80ms) — A small fine-tuned model that classifies inbound messages into categories before the main LLM gets involved
Industry-specific knowledge base — pgvector embeddings of operator-uploaded FAQs, plus an industry seed corpus
Main reasoning engine — GPT-4o with channel-aware system prompts
Appointment booking layer — Integrates with Google Calendar / Calendly to fetch real-time availability and confirm bookings
Escalation engine — Monitors confidence per turn; triggers warm handoff with full context when threshold drops
Channel adapters — Single agent core, three channel adapters (WhatsApp via Twilio, web chat via custom widget, voice via Twilio Voice + Deepgram STT)

The data layer is Supabase. Hosted on Cloudflare Pages with edge functions for the sub-80ms intent path.

Key Technical Decisions

Why a Pre-LLM Intent Classifier

Sending every message to GPT-4o sounds simpler. It's also slow (700–1500ms per turn) and expensive (~₹0.5–2 per message). For service businesses doing 100+ inbound interactions per day, the cost compounds.

The intent classifier is a small fine-tuned model that handles 80% of messages without involving the main LLM:

Greeting? Reply with a templated greeting.
Hours/location query? Reply from a small pre-rendered FAQ.
Pricing? Pull from a structured price list.
Anything ambiguous? Hand off to GPT-4o.

This drops average response time from ~1100ms to ~250ms and reduces LLM costs by ~70%. The quality stays high because the templated paths handle the high-volume simple cases that don't benefit from LLM reasoning anyway.

For the broader pattern of routing different sub-tasks to different models, see our ChatGPT vs Claude vs Custom LLM decision framework.

pgvector for Industry-Specific Knowledge

Every industry has its own vocabulary. A dental clinic's FAQs use words like "implant", "crown", "RCT". A real estate agency uses "carpet area", "RERA", "ready possession". A generic LLM trained on the open web doesn't know which sense of "RCT" to use in a dental context.

We solved this with pgvector. Operators upload their own FAQs (text, PDF, or web URL). We embed them with OpenAI's text-embedding-3-large and store them in pgvector. On every inbound message, we retrieve the top-K most similar FAQ chunks and inject them into the system prompt before the main LLM runs.

The result: the LLM answers with the operator's own knowledge, in the operator's vocabulary, without any hallucination. The escalation engine catches the cases where retrieval confidence is low.

Channel Adapters Around a Single Reasoning Core

WhatsApp, web chat, and voice are three different channels with three different constraints. WhatsApp messages are async; web chat is synchronous; voice is real-time streaming. The naive approach is three separate agents.

We built one core reasoning engine and three thin channel adapters:

WhatsApp adapter: receives Twilio webhook → preprocesses for WhatsApp formatting (no markdown) → calls core → posts response via Twilio
Web chat adapter: WebSocket-based; preserves session history client-side; calls core directly with conversation context
Voice adapter: Twilio Voice → Deepgram STT (streaming) → core → ElevenLabs TTS → back to Twilio

The core handles intent classification, knowledge retrieval, LLM call, escalation. Channel adapters handle their own media format.

This pattern means a new channel (say, Instagram DMs) takes 2–3 days, not 2–3 weeks.

For a deeper look at how to design agent systems generally, see How to build an AI agent and Agentic AI Systems.

Confidence-Based Escalation

The hardest design problem in this system was escalation. Escalate too eagerly and the AI value disappears. Don't escalate enough and the AI confidently gives bad answers.

Our approach: every turn outputs both an answer and a confidence score (0–1) based on:

Knowledge-base retrieval similarity (was relevant content found?)
LLM token-level probability variance (was the model confident?)
Conversation drift detection (is the user repeating themselves with frustration?)

When the running confidence drops below the operator's threshold, escalation triggers — the AI sends a graceful handoff message ("Let me get a team member to help — they'll be with you in a few minutes") and notifies the human team via WhatsApp/Slack with the full conversation context summary.

Operators can tune their threshold per industry. A dental clinic might want low escalation threshold (be cautious); a real estate inquiry desk might want high threshold (handle more autonomously).

No Hallucinations on Appointment Booking

The strictest safety requirement. AI must NEVER confirm an appointment for a slot that's not actually available.

We handle this with hard tool-call constraints. The agent can't generate appointment confirmations as text; it can only call the book_appointment(time_slot) tool. That tool first verifies availability via Google Calendar API. If unavailable, it returns "slot not available" and the agent has to ask for a different time. If available, it books and returns confirmation.

The agent literally cannot lie about availability — the tool layer doesn't permit it.

Why This Matters for AI Product Builders

Kairo is the template for vertical AI agents in service industries. The pattern:

Multi-layer architecture with cheap intent classification before expensive LLM calls
Industry-specific knowledge base via vector retrieval
Channel adapters around a shared reasoning core
Tool-layer safety constraints, not prompt-based safety

We expect this to be the dominant architecture for service-business AI agents through 2027. For India-specific use cases that fit this pattern, see AI Automation for Indian SMBs.

What We'd Do Differently

Use Anthropic Claude over GPT-4o for the main reasoning. By 2026 Claude has stronger tool-use reliability; we'd switch on a new build.
Build the operator dashboard with stricter guardrails on FAQ uploads. Some operators upload PDFs with conflicting answers, and the agent gets confused.
Ship with Hindi/regional language support from day one. We added it later; would have been higher launch impact at v1.

Where Nexolve Fits

We build vertical AI agents and customer-facing automation via our AI-Powered Automation service. For the full project context, see the Kairo AI Receptionist case study.

Multi-Channel AI Receptionist Architecture: Kairo Engineering Teardown