Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

We illustrate the model structure in Figure. 1 . The two main components of OpenVoice are the base speaker TTS model and the tone color

We illustrate the model structure in Figure. 1. The two main components of OpenVoice are the base
speaker TTS model and the tone color converter. The base speaker TTS model is a single-speaker or
multi-speaker model, which allows control over the style parameters (e.g., emotion, accent, rhythm,
pauses and intonation), accent and language. The voice generated by this model is passed to the tone
color converter, which changes the tone color of the base speaker into that of the reference speaker.
Base Speaker TTS Model. The choice of the base speaker TTS model is very flexible. For example,
the VITS [6] model can be modified to accept style and language embedding in its text encoder and
-luration predictor. Other choices such as InstructTTS [17] can also accept style prompts. It is also
possible to use commercially available (and cheap) models such as Microsoft TTS, which accepts
speech synthesis markup language (SSML) that specifies the emotion, pauses and articulation. One
can even skip the base speaker TTS model, and read the text by themselves in whatever styles and
languages they desire. In our Open Voice implementation, we used the VITS [6| model by default,
but other choices are completely feasible. We denote the outputs of the base model as x(LI,SI,CI),
where the three parameters represent the language, styles and tone color respectively. Similarly, the
speech audio from the reference speaker is denoted as x(LO,SO,CO).
Tone Color Converter. The tone color converter is an encoder-decoder structure with a invertible
normalizing flow [12] in the middle. The encoder is an 1D convolutional neural network that takes
the short-time Fourier transformed spectrum of x(LI,SI,CI) as input. All convolutions are single-
strided. The feature maps outputted by the encoder are denoted as Y(LI,SI,CI). The tone color
extractor is a simple 2D convolutional neural network that operates on the mel-spectrogram of the
input voice and outputs a single feature vector that encodes the tone color information. We apply it
on x(LI,SI,CI) to obtain vector v(CI), then apply it on x(LO,SO*CO) to obtain vector v(CO).
The normalizing flow layers take Y(LI,SI,CI) and v(CI) as input and outputs a feature representa-
tion Z(LI,SI) that eliminates the tone color information but preserves all remaining style properties.
The feature Z(LI,SI) is aligned with International Phonetic Alphabet (IPA)[1] along the time
dimension. Details about how such feature representation is learned will be explained in the next
section. Then we apply the normalizing flow layers in the inverse direction, which takes Z(LI.SI)
and v(CO) as input and outputs Y(L1,SL,CO). This is a critical step where the tone color CO from
the reference speaker is embodied into the feature maps. Then the Y(LI,SI,CO) is decoded into raw
waveforms x(LI,SI,CO) by HiFi-Gan [7] that contains a stack of transposed 1D convolutions. The
entire model in our OpenVoice implementation is feed-forward without any auto-regressive compo-
Please Explain this above passage
image text in transcribed

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions