Riff - Voice-First AI Music Production Partner

Problem

There’s a class of music user who has strong taste but no production craft. They can hear what they want (they can hum the chorus) but the path from “this is the feel” to a finished song is full of friction. Existing AI music tools mostly take text prompts and spit out a track; they don’t iterate with you, they don’t listen to your hum and turn it into the lead trumpet of the chorus, and they make you context-switch into a DAW you don’t know how to use.

I’m building something that meets a non-producer where they are: speaking, humming, singing. No chat panel. The partner listens, narrates while it works, and you direct it the way you’d direct a session musician.

Where it stands

This is in active development. The brief schema, voice I/O pipeline, partner agent, and app shell are specced and partly implemented. What’s documented below is the system I’m building toward, not a shipped product. Source is private while it’s still moving fast. The portfolio chat at the top of this site, the design discipline, and the technical decisions are all real; the end-user-facing music experience is not yet.

Approach

A JARVIS-style ambient partner: always listening, always interruptible, with personality. The screen is a heads-up reflection of what’s currently happening, not a transcript log. The design target: the partner translates user intent (verbal or melodic) into structured musical decisions, renders ~30s of audio in under a minute, and iterates section by section.

Brief schema. The data structure between user intent and the audio engine. Typed song state (sections, lyric lines, instrument stems with stable IDs), an append-only log of typed operations, versioned history. Captured audio clips are first-class entities with lineage. A user’s hum can become the trumpet of the chorus, and the partner-generated trumpet keeps a parent_clip_id pointing back at the original hum.
Voice I/O pipeline. Always-on listening with voice activity detection, Deepgram for ASR, Cartesia for TTS, streaming so the partner can speak its acknowledgement within ~500ms while a 15–30s render is still in flight.
Partner agent. Claude as the LLM, emitting typed operations against the brief schema via tool calls. Speech and state-mutation are independent channels, so the partner can narrate “letting you off the hook; sharpening it now” before the render starts.
Focus concept. No chat means “play it again,” “try another,” and “set it as the lead” all depend on what’s currently active. A Focus value lives at the song level and tracks active section, line, clip, and mode (editing lyrics / auditioning variants / playing / rendering). UI and intent resolution both read from it.
Local-first. IndexedDB for song state and captured audio clips. ONNX Runtime in the browser for client-side ML.

Voice-first means a captured hum is a first-class entity, not metadata. The partner can transform it, audition variants against it, and promote one into a section’s stem. The same clip ID flows through the lineage graph.

Architecture

Mic always-on · VAD

→

Deepgram ASR

→

Claude partner agent

→

Brief Schema typed ops

→

Compiler engine payload

→

Audio Engine ~30s render

→

Cartesia TTS narration

The partner agent emits typed operations against the schema; a state machine validates them and produces a new SongVersion. A render compiler translates state plus a render scope into engine-specific payloads, so the schema is engine-agnostic.

Design bets

Voice-first no chat panel. Only ambient speech, audio, and a heads-up reflection of state

Typed ops every edit is a semantic operation with audit lineage, not an opaque state diff

Clip lineage user hums and AI-generated variants share an addressable graph

Reflections so far

Removing the chat panel is harder than adding one. Every “play it again,” “try another,” “set it as the lead” depends on an explicit Focus concept that the partner agent and UI both read from.
The brief schema is where the project lives or dies. Get the typed operations right and the LLM has a stable surface to act on; get them vague and you’re back to opaque state diffs.
Modern LLMs handle typed JSON output via tool calls reliably enough to make structured operations the default, not an optimisation to add later.
Cartesia + Deepgram + Claude streaming makes ~500ms verbal acknowledgement feasible on paper. Whether it holds up in front of a real user with a mic is the next thing to find out.