← Back to Blog
ArbitDiffSingerVocal Synthesis

DiffSinger Inside a VST: Neural Vocal Synthesis for Producers

March 5, 2026 · 8 min read · DonutsDelivery

What Is DiffSinger?

DiffSinger is a neural network for singing voice synthesis (SVS). Given a melody, lyrics, and expression parameters, it generates realistic vocal audio — not by splicing recorded samples together, but by predicting the waveform from scratch using a diffusion model trained on singing voice data.

The result sounds natural in a way that concatenative synthesis (UTAU, Vocaloid) can't match. Transitions between phonemes are smooth. Vibrato emerges organically from the model. Breathiness, tension, and vocal color are continuous parameters, not on/off switches.

Until now, using DiffSinger meant running a standalone editor or command-line tool, rendering WAV files, and importing them into your DAW manually. Arbit brings DiffSinger directly into a VST3/CLAP plugin.

The Workflow

In Arbit's piano roll, vocal synthesis is part of the composition process — not a separate rendering step:

Change a note, adjust expression, edit lyrics — the render updates automatically. No export/import cycle, no file management.

Six Expression Lanes

DiffSinger in Arbit exposes six continuous parameters per note, each drawn as an automation lane:

Each lane can be set per-track (as a default for all notes) or overridden per-note for detailed control. This means you can set a track-wide vocal character and then sculpt individual phrases with per-note adjustments.

Speaker Blending

DiffSinger models can be trained on multiple speakers. Arbit supports speaker blending — interpolating between voices by weight. Set a note to 70% one voice, 30% another, and the output blends the timbral characteristics of both.

This enables continuous voice color changes across a performance. Morph from one singer to another over the course of a phrase, or create hybrid voices that don't exist in any training data.

Phoneme Editing

Arbit's phonemizer converts input lyrics to ARPAbet phonemes automatically. But you have full control over the result:

This level of control matters for getting natural timing on consonant clusters, diphthongs, and cross-language lyrics.

Why DiffSinger in a VST Matters

Standalone vocal synthesis tools create a disconnect in the production workflow. You write music in your DAW, switch to a separate editor for vocals, render, import, realign. Every edit requires repeating the cycle.

With DiffSinger inside Arbit, vocals live in the same environment as every other part of your composition. They respond to the same transport, the same tempo changes, the same harmonic links. A vocal note linked at a 5:4 ratio to an instrumental bass note will render at that pure major third — dynamic harmony applied to neural vocals.

No other vocal synthesis plugin combines neural SVS with per-note just intonation tuning. Synthesizer V and Vocaloid render to 12-TET. UTAU editors don't run inside DAWs. Arbit is the first tool where the tuning system and the vocal engine are aware of each other.

Try Arbit → | Why pure intervals sound different →