DiffSinger Inside a VST: Neural Vocal Synthesis for Producers

What Is DiffSinger?

DiffSinger is a neural network for singing voice synthesis (SVS). Given a melody, lyrics, and expression parameters, it generates realistic vocal audio — not by splicing recorded samples together, but by predicting the waveform from scratch using a diffusion model trained on singing voice data.

The result sounds natural in a way that concatenative synthesis (UTAU, Vocaloid) can't match. Transitions between phonemes are smooth. Vibrato emerges organically from the model. Breathiness, tension, and vocal color are continuous parameters, not on/off switches.

Until now, using DiffSinger meant running a standalone editor or command-line tool, rendering WAV files, and importing them into your DAW manually. Arbit brings DiffSinger directly into a VST3/CLAP plugin.

The Workflow

In Arbit's piano roll, vocal synthesis is part of the composition process — not a separate rendering step:

Place notes on the piano roll as usual — pitch, timing, duration.
Type lyrics directly onto each note. Arbit's phonemizer converts text to phonemes automatically, with support for English, Japanese, Chinese, and Korean.
Draw expression curves in six dedicated lanes below the piano roll.
Render — DiffSinger generates the vocal audio in the background. The waveform appears on the track for visual feedback.

Change a note, adjust expression, edit lyrics — the render updates automatically. No export/import cycle, no file management.

Six Expression Lanes

DiffSinger in Arbit exposes six continuous parameters per note, each drawn as an automation lane:

Gender (−1 to +1): Shifts formants to control perceived vocal character. Negative values produce a darker, larger vocal tract sound. Positive values brighten and thin the voice.
Breathiness (0 to 1): Controls the noise component in the vocal signal. At 0, the voice is clean and focused. At 1, it's a breathy whisper.
Tension (0 to 1): Models vocal fold tension. Higher values produce a tighter, more strained sound — useful for belting or emotional peaks. Lower values relax into a softer delivery.
Energy (0 to 1): Modulates phoneme loudness. Allows dynamic variation within a phrase without changing velocity.
Expressiveness (0 to 1): Controls amplitude vibrato intensity. At low values, the vocal is steady. At high values, the natural vibrato of the trained voice emerges fully.
Velocity (0.5 to 2.0): Adjusts articulation speed. Lower values produce legato, drawn-out consonants. Higher values create crisp, percussive attacks.

Each lane can be set per-track (as a default for all notes) or overridden per-note for detailed control. This means you can set a track-wide vocal character and then sculpt individual phrases with per-note adjustments.

Speaker Blending

DiffSinger models can be trained on multiple speakers. Arbit supports speaker blending — interpolating between voices by weight. Set a note to 70% one voice, 30% another, and the output blends the timbral characteristics of both.

This enables continuous voice color changes across a performance. Morph from one singer to another over the course of a phrase, or create hybrid voices that don't exist in any training data.

Phoneme Editing

Arbit's phonemizer converts input lyrics to ARPAbet phonemes automatically. But you have full control over the result:

Edit phoneme sequences when the auto-conversion gets it wrong
Adjust phoneme durations — drag boundaries to lengthen or shorten individual sounds within a note
Phoneme overlay — see phoneme boundaries directly on the piano roll

This level of control matters for getting natural timing on consonant clusters, diphthongs, and cross-language lyrics.

Why DiffSinger in a VST Matters

Standalone vocal synthesis tools create a disconnect in the production workflow. You write music in your DAW, switch to a separate editor for vocals, render, import, realign. Every edit requires repeating the cycle.

With DiffSinger inside Arbit, vocals live in the same environment as every other part of your composition. They respond to the same transport, the same tempo changes, the same harmonic links. A vocal note linked at a 5:4 ratio to an instrumental bass note will render at that pure major third — dynamic harmony applied to neural vocals.

No other vocal synthesis plugin combines neural SVS with per-note just intonation tuning. Synthesizer V and Vocaloid render to 12-TET. UTAU editors don't run inside DAWs. Arbit is the first tool where the tuning system and the vocal engine are aware of each other.

Try Arbit → | Why pure intervals sound different →