Audio samples from "VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design"

Abstract: Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-to-end single-stage approach.

Contents

Single Speaker (LJ Speech Dataset)
Experiments using Normalized Texts (LJ Speech Dataset)
Multiple Speakers (VCTK Dataset)
Comparison with JETS (Author Feedback, LJ Speech Dataset / texts from LibriTTS test set)

Single Speaker (LJ Speech Dataset)

	The only interval was the time necessary to ride in the elevator from the second to the sixth floor and walk back to the southeast corner.	Here and there a book is printed in France or Germany with some pretension to good taste,	Its bricks, measuring about thirteen inches square and three inches in thickness, were burned and stamped with the usual short inscription:	Two weeks pass, and at last you stand on the eastern edge of the plateau	the Secret Service had received from the FBI some nine thousand reports on members of the Communist Party.
Ground Truth
VITS
VITS2
w Deterministic Duration Predictor
w/o Alignment Noise
w/o Transformer Block in Normalizing Flows

Experiments using Normalized Texts (LJ Speech Dataset)

Ground Truth
VITS using Normalized Texts
VITS2 using Normalized Texts
	a third miscreant made a similar but far less serious attempt in the month of July following. Click here to view the sentence.	the same callous indifference to the moral well-being of the prisoners, the same want of employment and of all disciplinary control. Click here to view the sentence.	during which time a host of witnesses were examined, and the committee presented three separate reports, Click here to view the sentence.	gaming of all sorts should be peremptorily forbidden under heavy pains and penalties. Click here to view the sentence.	A man named Lears, under sentence of transportation for an attempt at murder on board ship, got up part of the way, Click here to view the sentence.

Multiple Speakers (VCTK Dataset)

	We have been going for three years.	Military action is the only option we have on the table today.	I don't think it would make any difference.	As agreed, the prime minister was driven to Westminster Hall.	This film will be totally awesome.
Ground Truth
VITS
VITS2

Comparison with JETS (Author Feedback, LJ Speech Dataset / texts from LibriTTS test set)

	"Now, Bannister, will you please tell us the truth about yesterday's incident?"	"No; I am quite proud of my person," was the reply.	I had to read it over carefully, as the text must be absolutely correct.	If she could only see Phronsie for just one moment!	I could not see that in either case Holmes had come upon the clue for which he was searching.
JETS
VITS2