Departed Spring 2025
Gemi
my room · AI agent · Shipped
[image: iPad mounted on the bedroom wall acting as a mirror — live camera feed of the room, Siri-style multicolored gradient border pulsing around the screen edge, semi-transparent dark debug terminal overlaid in the bottom-left corner streaming status logs]
Departure
Google Home Mini used to flip the lights when I asked. Then the family stopped paying for the subscription and the bridge to our Universal Devices ISY went silent — the lights remembered who used to ask but couldn't hear me anymore. I wanted the room back, but on better terms: no wake word, always listening, picking up commands the way a person in the room would. Lights, music, a clean iPad on the wall acting like a mirror. Build the assistant I wished Google had become.
Approach
- SwiftUI
- Gemini Live
- Gemini 2.0 Flash Thinking
- Llama 3.2:3b
- Fast Whisper
- Orpheus TTS
- Spotipy
- ISY REST
- Random Forest
No wake word — voice activation has to come from the room, and the loop has to be fast enough to feel like conversation.
Field log
Spring 2025 — opening
Google Home Mini used to flip the lights when I asked. Family stopped paying for the subscription, the bridge to our Universal Devices ISY went silent, and the room forgot how to listen.
[image: Problem diagram — line drawing of a Google Home Mini on the left connected to a router with a large red X over the line, router connected to a lightbulb on the right, two clouds floating above and below]
The link that died. What I want
No keyword. Automatic recognition — is it being spoken to, how many commands. Lights, music, and a clean iPad on the wall as the surface.
[image: Requirements slide — white outline of an iPad with four black icons inside the screen: a microphone with a slash through it, a lightbulb, a gear, and a musical note]
The wizard diagram
Streaming mic into a speech-to-text model, into an LLM that splits two ways — function calls in one direction, text-to-speech back to the user in the other. Magic, but plumbed.
[image: Architecture flowchart — microphone icon labeled 'Streaming mic' arrows into an STT box, into an LLM box, then splits into a 'Function calls' box with gears above and a 'TTS' box with a speaker below, with a wizard illustration in red robes casting a spell over a crystal ball at the bottom]
Mic in, wizard out. Lights
The Universal Devices ISY exposes a REST API — get and set against an XML node tree. Could've wired the whole house, but went with what the assistant would actually need to reach: room, lamp, hallway, bathroom.
Spotify
PyChromecast first, since the old Google Homes already spoke that language — but it couldn't pick songs by name. Switched to Spotipy 2.25.1, OAuth'd into my account, pointed playback at the speaker group I'd labeled Josh's Group years ago.
Dual route
Latency was the gap between assistant and presence. Split voice activation into two parallel tracks — Llama 3.2:3b on the spoken-response track ("yeah, on it"), Gemini 2.0 Flash Thinking on the function-call track. The room could answer out loud while Gemini was still deciding what to actually do.
[image: Efficient Assistant diagram — head icon speaking soundwaves into a microphone labeled 'Voice Activation', arrows splitting from a gear into two parallel boxes: 'Spoken Response | Fast' and 'Function Calls | Accurate']
STT — Fast Whisper
Whisper is the state of the art. Faster-Whisper is the open-sourced fork that other people made faster. About 150 ms response time — quick enough that turn-taking starts to feel like a conversation instead of a query.
LLM — small voice, big brain
Llama 3.2:3b on the response track is small and not particularly smart — but for "yeah, got it" and "alright, doing that now," small and fast is exactly what the route needs. Gemini 2.0 Flash Thinking carries the function calls. The spoken responses can be a little off; the function calls are perfect.
TTS — Orpheus over Nari Dia
Tried Nari Dia 1.6B first — expressive, but too slow for a voice that's supposed to interrupt itself. Orpheus TTS gave back a balance, plus inline tags for laugh, chuckle, and breath that nudge it past flat-affect synthesis.
iPad as mirror
Inspired by Siri's redesign — the gradient border around the screen edge that pulses with whatever's being spoken. Mounted the iPad on the wall as a mirror with the live camera feed underneath, gradient on top, and a small dark debug terminal overlaid in the corner so I could watch what it was hearing.
Stereo Love
Demo from the iPad mounted on the wall, gradient border pulsing with the audio. "Can you turn off my hallway lights?" — hallway clicks off through the doorway. "And turn off my room lights too." — overhead goes dark. "Can you play Stereo Love on Spotify?" — Spotipy pushes through Josh's Group, and Edward Maya fills the room.
Lamp off, live
End of the talk, no slide. Raised a hand: "Okay, turn off my lamp." The lamp behind me clicked off. Mic drop without the mic.
Truitt's note
Showed gapi v1 to a friend. Truitt: "really built jarvis just for a school project." Took that as the sign v1 wasn't enough — v2 should be proactive, know where I am, see my emails, control music.
Gemi — the v2 idea
Next page of the spiral notebook had a robot labeled Gemini standing between two speakers labeled System Input and User Input. The router fed it from the internet — void, chromecast, macbook, iphone — all running into the same listener.
Tech stack
Gemini at the center, surrounded by what it had to talk to — Universal Devices for the lights, Google APIs for everything Google, VLC and Google Cast for playback, Scikit-Learn for the presence model, Python on the server side, Swift on the iPad.
[image: Tech stack diagram — the word 'Gemini' with a blue star icon at the center, surrounded by logos in a circle: Universal Devices, Google APIs, VLC, Scikit-Learn, Google Cast, Python, Swift]
Gemini Live
Swapped the v1 STT-LLM-TTS chain for a single WebSocket to the Gemini Live API. Text, audio, and video out; text and audio back. One connection, real-time, no stitching three models together to fake the cadence.
iBeacon dead end
First attempt at presence detection: emit iBeacon advertisements from a Bluetooth card and triangulate from fixed receivers. The card refused to output the format. Spent days on it, never got a frame out.
Random Forest, 99.7%
Switched approaches. An iPhone app scanned every nearby MAC + dBm reading, labeled each scan by the room I was standing in (Bedroom, Media, Playroom), and POSTed to a tiny server in a constant stream. Four days of data later — RandomForestClassifier 99.7%, KNeighborsClassifier 98.6%, LinearSVC 98.0%. Live prediction came back Playroom 96 / Media 4 / Bedroom 0.
[image: Live prediction interface — dark mobile UI with a 'Live Prediction (0.25s)' panel showing certainty bars: Playroom 96%, Media 4%, Bedroom 0%, with 'Playroom' shown large in blue below as the 5-second winner with a green checkmark]
The room knows where I am. Where it confused itself
Confusion matrix only ever mixed Playroom with Media — the two rooms that share a wall. The Bluetooth signal couldn't tell which side of the drywall I was on, and honestly neither could I half the time.
Gemi — Gemini Smart Home
Closing slide: a Wii console box, repackaged. Same proportions, same white plastic, same translucent play button — relabeled 'Gemi' on the console and 'Gemini Smart Home' on the box. The product the room had quietly become.
From the gallery
[image: Spiral notebook sketch — system architecture diagram in pencil, 'The Internet' connected to a router, router branching out via lines labeled void / chromecast / macbook / iphone into a central robot figure labeled 'Gemini' with a star on its chest, robot standing between two large rectangular speakers labeled 'System Input' and 'User Input']
[image: Confusion matrix from LinearSVC — 3×3 grid with True label (Bedroom, Media, Playroom) on the Y-axis and Predicted label on the X-axis, blue gradient cells, strong diagonal (107 Bedroom, 129 Media correct), with a noticeably darker off-diagonal cell at Playroom-predicted-as-Media (95 misclassifications)]
[image: Spotipy speaker list — terminal/Python output enumerating the user's available Spotify Connect devices, with 'Josh's Group' highlighted as the active speaker group target for playback]
What I came back with
99.7% room presence
Lesson from the terrain
Latency was the gap between an assistant and a presence in the room. Splitting the response into a fast Llama track and an accurate Gemini track meant the room could answer while it was still figuring out what to do — closer to how a person responds than how an API does. The iPad-as-mirror was the other half: not a screen you summon, but a surface that's always there, pulsing back at you when it hears you.
Cross-links
This fed into / from