Audio samples from "VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design"

Abstract: Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-to-end single-stage approach.



Single Speaker (LJ Speech Dataset)

The only interval was the time necessary to ride in the elevator from the second to the sixth floor and walk back to the southeast corner. Here and there a book is printed in France or Germany with some pretension to good taste, Its bricks, measuring about thirteen inches square and three inches in thickness, were burned and stamped with the usual short inscription: Two weeks pass, and at last you stand on the eastern edge of the plateau the Secret Service had received from the FBI some nine thousand reports on members of the Communist Party.
Ground Truth
w Deterministic
    Duration Predictor
w/o Alignment Noise
w/o Transformer Block
    in Normalizing Flows  

Experiments using Normalized Texts (LJ Speech Dataset)

Ground Truth  
VITS using Normalized Texts
VITS2 using Normalized Texts
Click here to view the sentence.
Click here to view the sentence.
Click here to view the sentence.
Click here to view the sentence.
Click here to view the sentence.

Multiple Speakers (VCTK Dataset)

We have been going for three years. Military action is the only option we have on the table today. I don't think it would make any difference. As agreed, the prime minister was driven to Westminster Hall. This film will be totally awesome.
Ground Truth  

Comparison with JETS (Author Feedback, LJ Speech Dataset / texts from LibriTTS test set)

"Now, Bannister, will you please tell us the truth about yesterday's incident?" "No; I am quite proud of my person," was the reply. I had to read it over carefully, as the text must be absolutely correct. If she could only see Phronsie for just one moment! I could not see that in either case Holmes had come upon the clue for which he was searching.