Building Zubro – a voice-first English tutor on Google’s Gemini Live API

What is Zubro?

Zubro - a European bison with headphones and a green scarf

This article was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

Zubro is a voice-first English learning companion. You talk, Zubro listens, responds, corrects your grammar, and adapts to your level – all through real-time voice conversation. No typing, no multiple choice, no textbook exercises. Just a conversation with a patient AI tutor who happens to be a European bison wearing a green scarf.

A quick note on the name. In Polish, żubr (pronounced roughly “zhoobr”) means European bison – a national symbol and the largest land animal in Europe. I took the dot above the ż in the original Polish spelling, moved it to the end of the word, and got zubro. A slightly mangled version of a Polish word, used to teach English. Felt appropriate.

 

Zubro landing page - Meet Zubro. Your smart partner for practicing English.

The competition

Zubro was built for the Gemini Live Agent Challenge 2026, specifically the Live Agents category. The challenge: build a real-time agent that users can talk to naturally, that handles interruptions, and runs on Google Cloud. The key technology is Google’s Gemini Live API – a bidirectional WebSocket-based API that streams audio in both directions.

I had about a month and decided to go all in. What started as “let’s see if I can get a voice conversation working” turned into a full learning platform with six exercise types, a level assessment system, presentation coaching, live image generation, and an evaluation pipeline that runs alongside the conversation without interrupting it.

The architecture in 30 seconds

Browser talks to a Python backend (FastAPI on Cloud Run) over WebSocket. The backend talks to the Gemini Live API over another WebSocket. The browser never talks to Gemini directly. This is a deliberate choice – it lets the backend control everything: system instructions, tool definitions, evaluation pipelines, session lifecycle, mic control. The frontend is just an audio pipe and a UI layer.

Browser (JS) <--WebSocket--> Python Backend (Cloud Run) <--WebSocket--> Gemini Live API
                                    |
                                    +-- Image Generation API
                                    +-- Evaluation Models
                                    +-- Cloud Firestore
                                    +-- Cloud Storage

The whole thing runs on Google Cloud: Cloud Run for compute, Firestore for user data and progress, Cloud Storage for ephemeral file uploads, Secret Manager for API keys, and Firebase Authentication for sign-in.

What Zubro can do

Here’s what landed in the final build:

  • Guided onboarding – Zubro walks new users through the UI via voice, highlights status bar elements, and runs a mic test. All scripted from the backend using text injection.
  • Level assessment – four exercises (warmup conversation, free speech monologue, roleplay scenario, describe a generated slide). Produces a detailed report card with scores across grammar, vocabulary, pronunciation, fluency, and comprehension.
  • Free conversation – open-ended practice with live image generation, a personal practice list that persists across sessions, and on-demand feedback (“how am I doing?”).
  • Presentation coach – upload a PDF, practice presenting slide by slide with voice feedback, or discuss each slide with Zubro in a live two-way conversation. Zubro sees the current slide.
  • Roleplay scenarios – realistic situations (job interviews, ordering coffee, travel) with configurable time limits.
  • Session recovery – if the WebSocket dies during an assessment, progress is saved to Firestore. Reload and resume within an hour.

The interesting engineering problems

Zubro as a barista - roleplay scenario

Building a demo that plays a pre-recorded conversation is easy. Building something that actually works in real-time, recovers from failures, and doesn’t feel janky – that’s where it gets fun. Here are the problems I spent the most time on.

 

Sessions that drop

The Gemini Live API WebSocket connections can drop for various reasons. 1008 errors during tool calls, 1011 internal errors, GoAway messages when the server rotates, network hiccups. Each drop kills the conversation.

The solution is a multi-layer recovery system in my LiveSession class. GoAway messages trigger a transparent reconnect using session resumption handles. Unexpected crashes (1008/1011) also reconnect, but additionally replay the last user utterance that was lost – so the conversation picks up naturally instead of going silent. A time-based retry budget prevents infinite loops while being generous with one-off issues.

For the level assessment (which can take 10+ minutes), I also save progress to Firestore after each exercise. If everything fails and the session can’t recover, the user can reload the page and resume from where they left off.

Injecting context without interrupting

This one was surprising. When you send text into a Live API session via send_realtime_input(text=...), it’s treated as user input. If the model is currently speaking, your injection interrupts it – same as if the user started talking over it.

The fix: wait for the model to finish its current turn before injecting. I built a wait_for_turn_complete() method that listens for the turn_complete event from the recv loop. The pattern becomes: wait for turn complete, then inject text. Simple, but it unlocks a lot.

This is used for the scripted onboarding (each prompt waits for Zubro to finish before sending the next one) and for the graceful goodbye at end of sessions. But the bigger insight is that this pattern can handle any external process feeding information into a live session – background model calls, scheduled tasks, results from other services – all without interrupting the conversation. It essentially gives you full control over when and how the model receives new information.

Ending sessions gracefully

Every exercise has a time limit. When it fires, you can’t just cut the WebSocket – the user might be mid-sentence, or Zubro might be mid-response. A sudden disconnect feels broken.

I implemented a three-way decision tree based on who is speaking when the timer fires:

  1. Zubro is speaking – mute the user’s mic immediately, wait for Zubro to finish, then inject a goodbye.
  2. User is speaking – let them finish. When Zubro starts responding, mute the mic at that point. After Zubro’s response, inject goodbye.
  3. Silence – wait 3 seconds (the user might have just started talking but the transcription hasn’t arrived yet). If activity appears, fall back to case 1 or 2. Otherwise, proceed.

Because the Live API model has full conversation context, Zubro’s goodbye is natural – it references what was just discussed, not a canned “time’s up” message.

Evaluation without disruption

Here’s a problem I didn’t anticipate: you can’t ask the Live API model to be both a conversation partner AND a language evaluator. I tried – early experiments showed that when you ask the model to simultaneously maintain a fun conversation and produce precise grammar evaluations, both suffer. The scores were unreliable and the conversation felt stilted.

The solution: separate them entirely. Two independent evaluation pipelines run alongside the conversation, each using their own dedicated Gemini model:

  • OngoingEvaluator – fires after each user utterance, sends the audio chunk to Gemini Flash Lite, produces a running score. When the user asks “how am I doing?”, Zubro reads the latest result.
  • SessionEvaluator – collects all audio throughout the session, runs a comprehensive evaluation after the session ends, produces the detailed report card.

The Live API model does what it’s best at – being a natural, engaging conversation partner. Evaluation is offloaded to a specialist. Neither blocks the other.

The Firefox echo cancellation discovery

This one took a while to track down. The Gemini Live API sends audio at 24kHz. My initial approach was to set the AudioContext sample rate to 24kHz to match – seems logical, right? It worked fine on Chrome.

On Firefox, Zubro kept interrupting itself. It would start speaking, then detect its own voice as user input and stop. The echo cancellation was completely broken.

Turns out, when you set a non-standard AudioContext sample rate in Firefox, the browser’s built-in echo cancellation doesn’t work properly. The fix: leave the AudioContext at its default rate (48kHz) and let the browser handle resampling internally. Chrome is more forgiving; Firefox is not. This is the kind of thing you only find through testing on multiple browsers with actual audio output through speakers (headphones mask the issue because there’s no echo to cancel).

The audio pipeline

Zubro at a presentation podium

No third-party audio libraries. The entire capture and playback pipeline is vanilla Web Audio API:

  • Capture: an AudioWorklet that captures at 48kHz and downsamples to 16kHz PCM in a separate thread. The main UI never blocks.
  • Playback: each incoming audio chunk from Gemini becomes its own AudioBufferSourceNode, scheduled at a precise time offset. The browser’s internal audio scheduler handles gap-free playback. Interruption is trivial – just reset the next play time to audioContext.currentTime.

This approach (the playback part is borrowed from Google’s own reference implementation) eliminated the jitter and stuttering I saw with simpler approaches.

What I learned about the Live API

A few observations from a month of building on top of the Gemini Live API:

  • Session resumption is essential, not optional. Connections will drop. Build resumption into your session management from day one, not as an afterthought.
  • Tool calls are fragile. The Live API’s function calling triggers random 1008/1011 WebSocket crashes. It’s a known Gemini-side issue. My workaround is retry + resumption, but I also designed a pattern where you can offload heavy tool work entirely outside the Live API session – return a filler response instantly, run the real work via a standard API call, inject the result back non-disruptively.
  • The model is surprisingly good as a persona. With minimal system instructions (literally two sentences for the tutor persona), Zubro is consistent, patient, and engaging. I expected to need heavy prompting – I didn’t.
  • Audio in, audio out – keep it simple. The native audio model handles voice naturally. Don’t overthink the conversation side. Spend your complexity budget on what happens around the conversation – evaluation, state management, recovery.
  • Media injection just works. Sending images into the live session is seamless. The model sees and references them immediately. Text injection is the tricky one (it interrupts), but media doesn’t.

The tech stack

  • Backend: Python, FastAPI, Google GenAI SDK (google-genai)
  • Frontend: Vanilla JavaScript, Web Audio API
  • Auth: Firebase Authentication (Google sign-in)
  • Database: Cloud Firestore
  • Hosting: Google Cloud Run (non-root container)
  • Other Google Cloud: Secret Manager, Cloud Storage, Cloud Build, Artifact Registry, IAM with least-privilege service account
  • Models: 7 different Gemini models – gemini-2.5-flash-native-audio-preview for live voice, gemini-2.5-flash-lite and gemini-3.1-flash-lite-preview for evaluation, gemini-2.5-flash-image and gemini-3.1-flash-image-preview for image generation

Try it

Zubro is live at v1.zubro.eu. Sign in with Google, give it microphone access, and start talking. The deployment scales down to zero when idle, so the first load might take a few extra seconds.

If you’re building on the Gemini Live API and run into the same problems I did – session drops, tool call crashes, audio pipeline quirks – I hope some of these solutions save you time. It’s a powerful API once you learn to work with its edges.


Built for the Gemini Live Agent Challenge 2026 (Live Agents category). #GeminiLiveAgentChallenge

Leave a Reply

Your email address will not be published. Required fields are marked *