Voice AI· Jan 2025 — May 2025

Ballsy Voice Assistant

A deployed side project where I used FastAPI, WebSockets, and Gemini 2.0 Flash to make a browser-based voice assistant feel responsive, with sessions, rate limits, and Cloud Run deployment.

RoleFull-Stack Engineer (solo)

TimelineJan 2025 — May 2025

StatusLive demo available

Configured 1,000-session cap

30 req/min token bucket

Cloud Run deployment

Source Code

The Hook

“Most demo voice assistants break the moment a second user connects. No session isolation, no rate limiting, no graceful degradation — just a Siri lookalike that falls over under load.”

Business Use Case

I used Ballsy to learn the parts most voice-assistant demos skip: session isolation, rate limiting, reconnect behavior, and deploying the whole thing somewhere real. The main engineering goal was making sure two users did not share state and one user could not spam the backend.

Problem

I wanted to push the project past a single-user demo. That meant per-session state, per-user rate limits, WebSocket reconnect behavior, and a deployment setup that did not depend on my laptop. Gemini 2.0 Flash made the answers fast enough for a demo, but the surrounding app still had to handle all the normal web-app messiness.

Approach

FastAPI handles the WebSocket connection and REST fallbacks. Each client gets a session, and the command layer decides whether the request is a normal chat, web search, maps, media command, or calculation before sending anything to Gemini. I used Cloud Run, Cloud SQL, Secret Manager, and Terraform so I could deploy it repeatedly without rebuilding the setup by hand.

Architecture

01Browser client — Web Audio API for capture/playback, animated orb UI tied to listening/responding state, and a WebSocket client with reconnect logic.
02FastAPI gateway — WebSocket endpoint with heartbeat health checks, connection pooling, input validation, CORS + trusted-host middleware.
03Orchestration layer — intent classifier routes to command handlers (web search, maps, YouTube, calculator) before the LLM sees the message.
04Gemini 2.0 Flash — used for the assistant response after routing/command handling.
05Storage — Cloud SQL PostgreSQL for sessions/history; rate limiting implemented as a token bucket.
06Infra — Cloud Run, Secret Manager, Docker, and Terraform-managed GCP resources.

Challenges & Decisions

Mobile WebSocket flakiness

Mobile browsers would drop the connection when the tab was backgrounded or the network changed. I added reconnect logic and session resume instead of treating every reconnect as a brand-new user, but mobile browser audio is still one of the fragile parts.

Audio without gaps

The first version felt choppy because audio and UI state were not synchronized. I added buffering around playback and made the orb states reflect listening/thinking/speaking, but this is still browser-dependent and not as smooth as a native app.

Atomic rate limiting at scale

I did not want one user to hammer the command endpoint, so I added a token-bucket style limit at 30 requests per minute. The important lesson was less the number and more making the limit part of the backend, not just a disabled button in the UI.

Cold-start latency on Cloud Run

Cloud Run cold starts made the first request feel slower than local development. I tuned the deployment and connection setup, but I would still describe this as a deployed side project, not a latency-hardened production assistant.

Results

→Live deployment on Cloud Run with auto-scaling.
→Configured session cap and 30 req/min/user rate limit to keep the demo from being completely open-ended.
→Session isolation worked in my concurrency testing; I need to add the exact test size before making a stronger claim.
→Terraform covers the main GCP resources so I can recreate the deployment without clicking through the console.

What I'd Change

·This is a deployed side project, not a polished assistant product. It still depends heavily on browser speech/audio behavior.
·If I rebuilt it, I would separate the realtime transport layer from the LLM command layer earlier.
·Before claiming serious concurrency, I would add repeatable load tests with exact user counts and latency numbers.

Stack

Backend

FastAPIWebSocketsPython 3.11Jinja2SQLAlchemy + Alembic

AI

Gemini 2.0 FlashGoogle Cloud TTSSpeechRecognition (Google STT)

Frontend

JavaScriptWeb Audio APIHTML/CSS

Infra

Cloud RunCloud SQL PostgreSQLRedisSecret ManagerDockerTerraform

Quality

Playwright

All Projects Get in Touch