Full Technical Walkthrough

How Does an AI Interview Copilot Work?

The audio capture pipeline, Whisper transcription, question classification, RAG from your resume, SSE streaming, and Picture-in-Picture rendering — explained for ML-literate readers.

See the Pipeline Try OphyAI Free

Session Lifecycle

From session start to answer on screen — step by step

Upload resume and job description
Before the interview, upload your resume (PDF or DOCX) and paste the job description into OphyAI. The system chunks both documents and embeds them into a vector index that will be used for retrieval during the interview.
Enable audio capture
OphyAI requests microphone access (getUserMedia) and system audio access (getDisplayMedia with audio:true). Both streams are captured and mixed into a single interleaved 16kHz mono PCM stream.
Start real-time transcription
The audio stream is chunked into 1–3 second segments and sent to the transcription service (Whisper-architecture model). The rolling transcript buffer is continuously updated as the interview proceeds.
Question detected and classified
A lightweight classifier (fine-tuned BERT/DeBERTa) monitors the transcript for interview questions. When detected, it classifies the question type (behavioral, technical, situational) and triggers the retrieval step.
Retrieval-augmented generation
The system retrieves semantically relevant chunks from the pre-indexed resume and JD vector store. These chunks are injected into the LLM prompt as context, and the model generates a STAR-structured answer specific to the candidate.
Answer streamed to Whisper Mode overlay
The LLM response is streamed via SSE or WebSockets. Tokens appear progressively in the Document Picture-in-Picture overlay window — which is designed for private notes during normal screen-share workflows — within 1.5–3 seconds of the question ending.

Deep Dive

The six-stage technical pipeline

Stage 1

Audio Capture Pipeline

Two browser APIs capture audio simultaneously: getUserMedia() for the microphone and getDisplayMedia() with audio:true for system audio (which includes the interviewer's voice from the video call). Both streams are mixed client-side using the Web Audio API into a single interleaved stream, downsampled to 16kHz mono PCM — the format Whisper-architecture models expect.

getUserMedia() + getDisplayMedia(audio:true) → Web Audio API mix → 16kHz PCM

Stage 2

Real-Time Transcription

The PCM audio stream is chunked into 1–3 second segments and sent over a WebSocket to the transcription service. Server-side Whisper Large v3 (or a fine-tuned derivative) processes each chunk and returns a partial transcript in 300–600ms. The rolling transcript buffer is updated continuously. Speaker diarization — distinguishing the interviewer's voice from the candidate's — uses a lightweight pyannote.audio model to label turns.

Audio chunks → WebSocket → Whisper v3 → rolling transcript buffer

Stage 3

Question Classification

A fine-tuned DeBERTa classifier (≈150M parameters) monitors the rolling transcript for interview question signals: interrogative phrasing, turn-ending silence detected by VAD (Voice Activity Detection), and question-type keywords. When a question boundary is detected, the classifier assigns a category — behavioral, technical, situational, competency, or case/framework — that will guide downstream prompt construction. Inference runs in under 50ms on a T4 GPU.

Transcript → VAD boundary → DeBERTa classifier → question category

Stage 4

Retrieval-Augmented Generation (RAG)

At session setup, the user's resume and JD are chunked (512-token chunks, 64-token overlap) and embedded using text-embedding-3-small. Embeddings are stored in a per-session vector index. When a question arrives, the question text is embedded and a top-k=5 similarity search retrieves the most relevant resume chunks — specific projects, metrics, job titles, technologies. These chunks are injected into the generation prompt as grounding context.

Question embed → cosine similarity → top-5 resume chunks → prompt context

Stage 5

Streaming Answer Generation

The assembled prompt — system instructions, question type, retrieved resume context, job description summary, and the raw question — is sent to a hosted LLM (GPT-4o, Claude 3.5 Sonnet, or a fine-tuned Llama-3 variant depending on implementation). Generation is requested with streaming=true. The first token arrives in 200–400ms. Tokens stream via Server-Sent Events back to the client, which appends them to the overlay in real time. STAR structure (Situation, Task, Action, Result) is enforced via the prompt.

Prompt → LLM (streaming) → SSE → progressive token render

Stage 6

Document Picture-in-Picture Rendering

The streamed tokens are rendered into a Document Picture-in-Picture window (documentPictureInPicture.requestWindow()). This window exists at the OS compositor layer — above the capture boundary of screen-sharing software. It is displayed on the user's screen but is not captured when they share a tab or application window in Zoom, Teams, or Meet. The PiP window is styled as a compact overlay: bullet-pointed answer structure, dark background, high-contrast text for fast scanning.

documentPictureInPicture.requestWindow() → compositor layer → private overlay

Performance Engineering

The latency budget — why every millisecond matters

A human interviewer pauses naturally for 3–5 seconds before answering a difficult question. That is the total budget for the copilot pipeline. Every stage has to fit within that window — or the suggestion arrives too late to be useful.

Pipeline Stage	Latency	Notes
Audio capture + chunking	~0ms	Streaming — no accumulation delay
Transcription (1–3s audio chunk)	300–600ms	Whisper-architecture model, server-side
Question classification (BERT/DeBERTa)	~50ms	Lightweight fine-tuned classifier
Vector retrieval (top-k resume chunks)	20–80ms	Pinecone / FAISS / Chroma
LLM prompt construction	~10ms	Template fill + context injection
First LLM token (streaming)	200–400ms	GPT-4o / Claude 3.5 Sonnet / Llama
Total to first visible token	580ms–1140ms	Within natural thinking pause window
Complete answer rendered	2–4 seconds	Streaming fills the rest progressively

Engineering Challenges

Why is building a real-time AI interview copilot technically hard?

Latency

Every millisecond across six pipeline stages compounds. Keeping total end-to-end latency under 3 seconds requires server co-location with the transcription service, low-latency LLM routing, and client-side streaming rendering. Most consumer-grade implementations sit at 4–7 seconds, which is too slow for most interviewers.

Speaker diarization

Reliably determining who is speaking — the interviewer or the candidate — in a two-channel audio mix under variable acoustic conditions (phone audio, compressed Zoom audio, background noise) is an unsolved problem at high accuracy. Misattributed turns cause wrong-person transcripts and garbage input to the classifier.

Question boundary detection

Knowing when the interviewer has finished their question (and not just paused mid-sentence) requires VAD with sufficient silence duration thresholds, language model context, and phrasing heuristics. Triggering generation on a half-formed question wastes latency budget and generates irrelevant suggestions.

Context window management

A 60-minute interview generates thousands of transcript tokens. Keeping the accumulated conversation history, resume context, and JD in the LLM prompt without exceeding token limits — while maintaining coherence — requires windowed context strategies or summarization compression.

Network reliability

The entire pipeline depends on consistent sub-100ms network round-trips to the inference servers. Mobile hotspot, hotel Wi-Fi, and VPN interference all degrade the latency budget significantly. Edge deployments (Cloudflare Workers AI, AWS Lambda@Edge) reduce geographic latency but increase infrastructure complexity.

FAQ

Frequently Asked Questions

The copilot captures two audio streams using the browser's Web Audio API: the candidate's microphone (via getUserMedia()) and the system audio — including the interviewer's voice coming from the video call — via getDisplayMedia() with audio:true. These two streams are mixed at the client and sent to the transcription service as a single interleaved PCM stream, typically encoded as 16kHz mono WAV for efficiency.

Transcription uses Whisper-architecture models (OpenAI Whisper or fine-tuned variants like Whisper Large v3) running either server-side or — in newer implementations — client-side via ONNX runtime in the browser. Audio is chunked into 1–3 second segments and sent to the transcription endpoint. The response is typically returned in 300–600ms per chunk, maintaining a rolling transcript buffer that the downstream question classifier reads continuously.

A lightweight classifier — typically a fine-tuned BERT or DeBERTa model — monitors the rolling transcript for signals that indicate an interview question: interrogative structure, turn-ending silence, and question-type keywords. It classifies questions into categories (behavioral, technical, situational, competency-based, case/framework) and passes the classification to the retrieval layer along with the raw question text. Classification inference runs in under 50ms.

At session setup, the user's resume and the job description are chunked and embedded into a vector store (Pinecone, Chroma, or an in-memory FAISS index). When a question arrives, the system retrieves the most semantically relevant resume chunks — specific projects, achievements, skills, dates — and injects them into the LLM prompt as context. This is what makes answers specific to the user rather than generic templates. The generation model then produces a STAR-structured answer grounded in retrieved facts.

Answer generation uses streaming output — Server-Sent Events (SSE) or WebSockets — so the first token appears on screen within 300–500ms of the generation call starting. The UI renders tokens as they arrive, giving the candidate a progressive view that fills in over 1–2 seconds. Total end-to-end latency from question end to first visible token is typically 1.5–3 seconds for a well-optimized pipeline.

The Document Picture-in-Picture API (W3C, Chrome 116+) renders a floating browser window in a separate document context at the OS compositor layer. The copilot renders the AI answer suggestions into this PiP window so they appear on the user's screen and are designed for private notes during normal screen-share workflows. The window is styled to be compact (typically 400×300px) and positioned on a second monitor outside camera view.

A well-engineered pipeline targets: Audio capture + chunking: 0ms (streaming). Transcription per chunk (1–3s audio): 300–600ms. Question classification: 50ms. Vector retrieval (top-k chunks): 20–80ms. LLM prompt construction: 10ms. First token from LLM (streaming): 200–400ms. Total to first visible token: 580ms–1140ms. Total to complete answer: 2–4 seconds. This fits within a natural "pause to think" window of 3–5 seconds that candidates use before answering.

Premium copilots include a vision layer: the user's screen or the interviewer's shared screen is captured via getDisplayMedia(), and frames are periodically sent to a vision model (GPT-4o Vision or similar) that reads code, diagrams, and whiteboard content. For system design questions, the system generates a structured architectural response (components, data flow, trade-offs) rather than a STAR narrative. For live coding, it can suggest algorithmic approaches given the visible problem statement.

The core challenges: (1) Latency — every component in the pipeline adds delay; keeping total end-to-end under 3 seconds requires aggressive optimization at each layer. (2) Speaker diarization — reliably distinguishing who is speaking (interviewer vs. candidate) in a two-channel audio mix. (3) Question boundary detection — knowing when the interviewer has finished their question and a response is appropriate. (4) Context window management — keeping resume and JD context in the LLM prompt without exceeding token limits across a long interview. (5) Reliability under variable network conditions.

OphyAI currently uses server-side inference for both transcription and generation. Audio is streamed to the server over an encrypted WebSocket, transcribed, classified, and the generation call is made to a hosted LLM with the user's pre-indexed resume context. Some processing (audio mixing, chunking, PiP rendering) happens client-side in the browser. On-device inference via WebAssembly/ONNX is an active area of development that would eliminate network-round-trip latency for transcription.

See the pipeline in action — try OphyAI free

Real-time Whisper transcription, RAG from your resume, streaming generation, Whisper Mode overlay. From $29/mo — free credits to start.

Start Free See Interview Copilot

How Does an AI Interview Copilot Work?

From session start to answer on screen — step by step

The six-stage technical pipeline

Audio Capture Pipeline

Real-Time Transcription

Question Classification

Retrieval-Augmented Generation (RAG)

Streaming Answer Generation

Document Picture-in-Picture Rendering

The latency budget — why every millisecond matters

Why is building a real-time AI interview copilot technically hard?

Frequently Asked Questions

How does an AI interview copilot capture audio?

How does real-time transcription work in an AI interview copilot?

How does question classification work?

What is retrieval-augmented generation (RAG) in an AI interview copilot?

How is the answer streamed to the screen?

What is the Document Picture-in-Picture API and how is it used?

What is the total latency budget for an AI interview copilot?

How does the copilot handle technical and system design questions?

Why is building a real-time AI interview copilot technically hard?

Does OphyAI run transcription and AI on-device or server-side?

See the pipeline in action — try OphyAI free