What is the Realtime AI Conversation feature?

A full-duplex, native Speech-to-Speech simulation step that replaces the legacy sequential STT-to-TTS pipeline, enabling sub-400ms AI patient responses with natural interruption handling.

What technologies power the feature?

OpenAI’s gpt-realtime-1.5 model over WebRTC for bi-directional audio, Web Audio API for reactive visualisations, and Azure Functions for ephemeral session authentication.

SimTutor — Realtime AI Patient Conversation

SimTutor’s simulation training platform lets nursing and medical students practise clinical encounters with AI patients. The experience hinges on one thing: does the patient feel real?

Until now, the answer was “mostly.” The AI was competent — it followed case scenarios, responded appropriately, and assessed learner performance. But every exchange had a 2–3 second gap between the learner finishing a sentence and the patient responding. That gap breaks immersion in a way that’s hard to recover from.

The problem was the conversation model. The existing system processed each message in sequence: the learner clicked to record, clicked again to send, then waited while the system transcribed their audio to text, ran it through the language model, and converted the response back to speech. Every step added delay. Worse, the system was un-interruptible — learners had to wait for the AI to finish speaking before they could respond, even if the patient was heading in the wrong clinical direction.

Real clinical conversations don’t work like that.

From turn-based to real-time

We replaced the sequential pipeline with a native Speech-to-Speech model that processes audio directly, without converting to text in between. The AI hears the learner and speaks back in real time, the same way a person would.

The connection uses WebRTC — the same technology behind video calls — for bi-directional audio streaming. This handles the network complexity that clinical environments demand, including institutional firewalls and inconsistent Wi-Fi, while keeping latency low.

Session security uses short-lived tokens generated on demand. No long-lived credentials are stored in the browser.

Natural interruptions

Low latency alone isn’t enough. The harder problem is making conversations feel natural — and that means handling interruptions correctly.

The system’s silence detection is tuned specifically for clinical assessment. Nurses and medical students need time to think between questions — a moment to consider medication history or formulate a follow-up — without the AI jumping in prematurely.

When a learner does start speaking while the AI patient is mid-sentence, the AI stops immediately. The transcript updates to show only what the patient actually said before being interrupted — no phantom sentences that the learner never heard.

We also built filtering to handle background noise in simulation labs. Other students, equipment, and HVAC systems frequently produce sounds that speech models misinterpret as words. The system rejects these false inputs so the conversation stays on track.

Session management

Real-time AI connections are billed by connection time, so cost management matters. If no message is sent for 120 seconds, the session ends automatically. This threshold was chosen deliberately — two minutes without a conversational exchange is well beyond the natural rhythm of a clinical encounter, and tying the timeout to messages rather than audio prevents background noise from keeping expensive sessions alive indefinitely.

When a learner resumes, the full conversation history is restored into the new session. The patient remembers everything previously discussed. From the learner’s perspective, the conversation picks up exactly where it left off.

Immersive interface

The visual layer reinforces the sense of presence. The patient avatar is surrounded by a glow that pulses with the AI’s speech — a continuous, reactive visualisation that makes the patient feel like they’re actually speaking.

The learner’s side shows a waveform meter that responds when the microphone is active, giving clear visual feedback about what the system is hearing.

Full-screen background images create a clinical atmosphere — an exam room, a ward, a consultation space — that grounds the encounter in a specific context.

The platform includes a library of distinct AI voices, allowing simulation authors to match patient demographics to the clinical scenario.

What it means

The result is a simulation step where the AI patient responds in under a second, can be interrupted mid-sentence, and maintains context across session boundaries. For nursing and medical students, this is the difference between practising a clinical encounter and practising with a chatbot.

The approach — real-time audio streaming, natural interruption handling, reactive visual feedback — is applicable well beyond healthcare simulation. Any domain where AI voice interaction needs to feel like a conversation rather than a command interface faces the same set of problems. This is one way to solve them.

Pattern partners with SimTutor on product strategy, engineering, and cloud infrastructure. This feature is available now on the SimTutor platform.

Realtime AI Patient Conversation

From turn-based to real-time

Natural interruptions

Session management

Immersive interface

What it means

Frequently Asked Questions

Have a similar
challenge?

Realtime AI Patient Conversation

From turn-based to real-time

Natural interruptions

Session management

Immersive interface

What it means

Frequently Asked Questions

Have a similar challenge?

Have a similar
challenge?