Introduction

VC aims to convert the non-linguistic information of the speech signals while maintaining the linguistic content the same, which is helpful for:

multi-speaker text-to-speech
speech synthesis
speech enhancement^[1]
data augmentation^[2]
pronunciation correction
accent remove^[3]
- voice quality of a non native speaker and the pronunciation patterns of a native speaker
- can be used in language learning
speaking style conversion
- emotion^[4]
- normal to Lombard ^[5]
- whisper to normal^[6]
improving the speech intelligibility^[7]^[8]
- surgical patients who have had parts of their articulators removed
- improve intelligibility and naturalness for deaf speaker's hearing-impaired speech
singers voice conversion^[9]^[10]^[11]^[12]

Categories

Supervised VC with parallel data

Unsupervised VC without parallel data

Feature Disentangle: translating the speech to phoneme posterior sequences by ASR system, and then synthesizing the speech with the target domain synthesizer

Direct Transformation: utilize deep generative model (GAN/VAE)

The encoder encode the source $x$ to the latent representation $\textrm{enc}(x)$ . The latent representation $\textrm{enc}(x)$ only contains the linguistic content so that the decoder could recover the source with the speaker identity while the classifier-1 cannot distingush the speakers^[13].

The encoder-decoder architecture.	The classifier guides the model to learn to disentangle features.

So the feature distenganlement means the separation of linguistic content or phonetic content and speaker identity. The objective of the classifier or discriminator is to adversarially guide the model to learn to disentangle features. But how to maintain the linguistic content? One way is to use the (ASR task) pre-trained content encoder^[14]^[15]^[16]^[17]. For the speaker identity, one hot vector for each speaker, or make use of speaker embedding ( i-vector, d-vector, x-vector, etc… )

Feature Disentanglement

The schematic architecture of voice conversion network. We employ an encoder E, a domain confusion network C and a conditional decoder D.

ASR pre-trained content encoder	Instance normalization

PPG-based methods

Phonetic posteriorgram (PPG) is obtained from a speaker-independent automatic speech recognition (SI-ASR) system^[15]^[18]. These PPGs can represent articulation of speech sounds in a speaker-normalized space and correspond to spoken content speaker-independently.

A PPG is a time-versus-class matrix representing the posterior probabilities of each phonetic class for each specific time frame of one utterance. A phonetic class may refer to a word, a phone or a senone.

Schematic diagram of VC with PPGs. SI stands for speaker-independent. Target speech and source speech do not have any overlapped portion.

As illustrated in Figure above, the proposed approach is divided into three stages: training stage 1, training stage 2 and the conversion stage. The role of the SI-ASR model is to obtain a PPGs representation of the input speech. Training stage 2 models the relationships between the PPGs and Mel-cepstral coefficients (MCEPs) features of the target speaker for speech parameter generation. The conversion stage drives the trained DBLSTM model with PPGs of the source speech (obtained from the same SI-ASR) for VC.

In ^[3], the authors proposed a PPG system for foreign accent conversion (FAC). They use an acoustic model trained on a native speech corpus to extract speaker-independent phonetic posteriorgrams (PPGs), and then train a speech synthesizer to map PPGs from the non-native speaker into the corresponding spectral fea- tures, which in turn are converted into the audio waveform using a high-quality neural vocoder. At runtime, we drive the synthesizer with the PPG extracted from a native reference ut- terance.

Overall workflow of the proposed PPG system	PPG-to-Mel conversion model

MSVC model with i-vector

In ^[17], two different systems to achieve any-to-any VC:

The i-vector-based VC (IVC) system
The speaker- encoder-based VC (SEVC) system.

Both systems train a bidirectional long-short term memory (DBLSTM) based multi-speaker voice conversion (MSVC) model. The IVC system uses i-vectors to encode speakerIDs, while the SEVC system uses learnable speaker embeddings to encode speakerIDs.

Schematic diagram of an any-to-any voice conversion system.

The key observations from the results are as follows:

Both the proposed IVC and SEVC systems can achieve VC across a new source-target speaker pair using only one target-speaker utterance. The converted speech has desirable quality and similarity.
The IVC system is superior to the SEVC system in terms of the converted speech’s quality and similarity

PitchNet

A singing voice conversion method proposed in ^[10]. The proposed PitchNet added an adversarially trained pitch regression network to enforce the encoder network to learn pitch invariant phoneme representation, and a separate module to feed pitch extracted from the source audio to the decoder network.
PitchNet consists of five parts, an encoder, a decoder, a Look Up Table (LUT) of singer embedding vectors, a singer classification network and a pitch regression network. The audio waveform is directly fed into the encoder. The output of the encoder, the singer embedding vector retrieved from LUT and the input pitch are concatenated together to condition on the WaveNet^[19] decoder to output audio waveform.

The overall architecture of PitchNet	The architecture of the singer classification network	The architecture of the pitch regression network

Instance Normalization (IN) for feature disentanglement is applied in ^[20]. It shows that simply adding instance normalization without affine transformation to $E_c$ can remove the speaker information while preserving the content information. To further enforce the speaker encoder to generate speaker representation, we provide the speaker information to decoder by adaptive instance normalization (adaIN) layer. The idea is from style transfer in computer vision ^[21].

Model overview. Es is speaker encoder; Ec is content encoder and D is decoder. IN is instance normalization layer without affine transformation. AdaIN represents adaptive instance normalization layer. duces

The idea of IN is also applied in ^[4], where the speech signal is decomposed into an emotion-invariant content code and an emotion-related style code in latent space. Emotion conversion is performed by extracting and recombining the content code of the source speech and the style code of the target emotion.

Direct Transformation

A straightforwad method is to use the cycle consistency loss by cycleGAN^[22]^[23]^[1] and StarGAN-VC^[24]^[25].

CycleGAN	StarGAN

CycleGAN

Adversarial loss
To make a converted feature $G_{X→Y}(x)$ indistinguishable from a target $y$ . Restricts $G_{X→Y}(x)$ to follow the target distribution.

Cycle-consistency loss
Guarantee the linguistic consistency between input and output features. The adversarial loss would not necessarily guarantee that the contextual information of $x$ and $G_{X→Y}(x)$ will be consistent. This is because the adversarial loss only tells us whether $G_{X→Y}(x)$ follows the target data distribution and does not help preserve the contextual information of $x$ . There are infinitely many mappings that will induce the same output distributions, adversarial loss cannot guarantee the linguistic consistency.
Encourage bijection. Encourage $G_{X→Y}$ and $G_{Y→X}$ to find $(x, y)$ pairs with the same contextual information.

Forward-inverse mapping Inverse-forward mapping

$E_{x∼P_{Data}(x)}[ \parallel G_{Y→X}(G_{X→Y}(x)) − x\parallel_1]$ $E_{y∼P_{Data}(y)}[ \parallel G_{X→Y}(G_{Y→X}(y)) − y\parallel_1]$

Identity-mapping loss
To further encourage the input preservation^[23], as the second adversarial loss.

One-step adversarial loss Two-step adversarial losses

In ^[5], the Wasserstein distance metric (WGAN loss) with gradient penalty is considered.

StarGAN

For each pair of conversion $(X,Y)$ , CycleGAN need unique discriminator. For 100 speakers a group of generators and discriminators are required. StarGAN-VC^[24]^[25] is introduced to solve the problem, in which all speakers share the same generator and discriminator. The aim of StarGAN-VC is to obtain a single generator $G$ that learns mappings among multiple domains. To achieve this, StarGAN-VC extends CycleGAN-VC to a conditional setting with a domain code (e.g., a speaker iden- tifier). More precisely, StarGAN-VC learns a generator $G$ that converts an input acoustic feature $x$ into an output feature $x'$ conditioned on the target domain code $c'$ , i.e., $G(x, c') → x'$ .
In ^[24], for the CycleGAN and StarGAN are compared.

Concept of CycleGAN training.	Concept of StarGAN training.

Adversarial losses: Generators $G$ and $F$ want to fool the dscriminator $D_X$ and $D_Y$ . $D_X$ and $D_Y$ attempt to avoid being fooled by them.	Adversarial Loss: Generator $G$ takes acoustic feature and target attribute label $c$ as input. $G$ wants to fool the discriminator $D$ and $D$ avoids it.
Cycle consistency loss: Guarantee that $G$ or $F$ will preserve the linguistic information of the input speech. Encourage $F(G(x)) ≃ x$ and $G(F(y)) ≃ y$	Domain classification loss: For classifier $C$ and generator $G$ . Encourage $C$ correctly classifies $y \sim p(y \| c)$ and $G(x, c)$ as belonging to attribute $c$ .
.	Cycle consistency loss: To encourage $G(x, c)$ to be a bijection and guarantee that $G$ will pre- serve the linguistic information of input speech.
.	Identity mapping loss: Ensure that an input into $G$ will remain unchanged when the input already belongs to the target attribute $c′$ .

In ^[25], a source-and-target conditional adversarial loss is developed.

Comparison ofconditional methods in training objectives. “A” and “B” denote the domain codes and “A → B” represents the data converted from “A” to “B.” Circle and tri- angle markers denote real and fake data, respectively.

(a) In the classification loss, $G$ prefers to generate classifiable (i.e., far from the decision boundary) data.
(b) In the target conditional adversarial loss, $D$ needs to simultaneously handle hard negative samples (e.g., conversion between the same speaker $A → A$ ) and easy negative samples (e.g., conversion between completely different speakers $B → A$ ).
$\begin{equation} \begin{split} \mathcal{L}_{t-adv} &= E_{(x,c)∼P(x,c)}[\log D(x, c)] \\ & + E_{x∼P(x),c'∼P(c')}[\log(1 −D(G(x, c'), c'))] \\ \end{split} \end{equation}$
(c) The proposed source-and-target conditional adversarial loss can bring all the converted data close to the target data in both source-wise and target-wise manners. This resolves the unfair training condition in the target conditional adversarial loss (Figure (b)) and allows all the source domain data to be converted into the target domain data.
$\begin{equation} \begin{split} \mathcal{L}_{st-adv} &= E_{(x,c)\sim P(x,c),c'\sim P(c')}[\log D(x, c', c)] \\ & + E_{(x,c)\sim P(x,c),c'\sim P(c')} [\log D(G(x, c, c'), c, c')] \\ \end{split} \end{equation}$
where $c' \sim P(c')$ is randomly sampled independently of real data. Both $G$ and $D$ are conditioned on the source code $c'$ in addition to the target code $c$ .
The source-and-target conditional generator requires the availability of the source code $c'$ in inference, which is not required in the conventional StarGAN-VC.

For the objective evaluation, use Mel-cepstral distortion (MCD) and modulation spectra distance (MSD).

Comparison of MCD and MSD among models using different conditional methods in training objectives. We fix the StarGAN-VC2 conditional method in G network as modulation-based.

For the subjective evaluation, a mean opinion score (MOS) test (5: excellent to 1: bad) is conducted.

MOS for naturalness with 95% confidence intervals.

VAE

The conversion function is reformulated as an auto-encoder. The encoder $f_{\phi}(·)$ is designed to be speaker-independent and converts an observed frame into speaker-independent latent variable or code $z_n = f_{\phi}(x_n)$ . Presumably, $z_n$ contains information that is irrelevant to speaker, such as phonetic variations, $z_n$ is also refered to as phonetic representation^[26].
The decoder $f_{\theta}(·)$ reconstructs speaker-dependent frames with speaker representation $y_n$ as another latent variable and with $z_n$ . The reconstruct a speaker-dependent frame $\hat{x}_n = f_{\theta}(z_n, y_n)=f_{\theta}(f_{\phi}(x_n), y_n)$ .

Illustration of VAE-based non-parallel spectral conver- sion. The dashed line means copying. The latent variable z and y are concatenated.

The VAE guarrentee that the latent variable $z_n$ is Gaussian, that is speakers normalized or speaker-independent and can be regarded as linguistic information.
In ^[27], CycleVAE-based VC is proposed. CycleVAE is capable of recycling the converted spectra back into the system, so that the conversion flow is indirectly considered in the parameter optimization.

Flow ofthe conventional VAE-based (upper-part) and the proposed CycleVAE-based (whole diagram) VC. Converted input features include converted excitation features, such as lin- early transformed F0 values. One full-cycle includes the esti- mation of both reconstructed and cyclic reconstructed spectra. Each ofencoder and decoder networks are shared for all cycles.

Conditional VAEs (CVAEs) are an extended version of VAEs with the only difference being that the encoder and decoder networks can take an auxiliary variable $c$ as an additional input^[28].
The regular CVAEs impose no restrictions on the manner in which the encoder and decoder may use the attribute class label c. Hence, the encoder and decoder are free to ignore c by finding distributions satisfying $q_\phi(z|x, c) = q_\phi(z|x)$ and $p_\theta(x|z, c) = p_\theta(x|z)$ . This can occur for instance when the encoder and decoder have sufficient capacity to recon- struct any data without using $c$ . To avoid such situations, in ACVAE^[29], an information-theoretic regularization ^[30] from InfoGAN is introduced to assist the decoder output to be correlated as far as possible with $c$ . ACVAE means VAE with an auxiliary classifie (AC), is shown as bellow.

Illustration of ACVAE-VC.

In ^[31], a non-parallel VC framework with a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) is proposed. The model explicitly considers a VC objective when building the speech model. The speech model is modeling with CVAE and improved with WGAN by modifying the loss function.

Reference

Mimura, M., Sakai, S., & Kawahara, T. (2017). CROSS-DOMAIN SPEECH RECOGNITION USING NONPARALLEL CORPORA WITH CYCLE-CONSISTENT ADVERSARIAL NETWORKS. 134–140. ↩ ↩
Gokce Keskin, Tyler Lee, Cory Stephenson, O. H. E. (2019). Measuring the Effectiveness of Voice Conversion on Speaker Identification and Automatic Speech Recognition Systems Gokce. ↩
Zhao, G., Ding, S., & Gutierrez-osuna, R. (n.d.). Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams. 2–6. ↩ ↩
Gao, J., Chakraborty, D., Tembine, H., & Olaleye, O. (2019). Nonparallel Emotional Speech Conversion. 2858–2862. ↩ ↩
Seshadri, S., Juvela, L., Yamagishi, J., Rasanen, O., & Alku, P. (n.d.). CYCLE-CONSISTENT ADVERSARIAL NETWORKS FOR NON-PARALLEL VOCAL EFFORT BASED SPEAKING STYLE CONVERSION. ↩ ↩
Patel, M., Parmar, M., Doshi, S., Shah, N., & Patil, H. A. (2019). Novel Inception-GAN for Whisper-to-Normal Speech Conversion. September, 7–9. ↩
Chen, L., Lee, H., & Tsao, Y. (n.d.). Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech. 2–6. ↩
Biadsy, F., Weiss, R. J., Moreno, P. J., Kanevsky, D., & Jia, Y. (n.d.). Parrotron : An End-to-End Speech-to-Speech Conversion Model and its Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications. ↩
Polyak, A., Wolf, L., Adi, Y., & Taigman, Y. (2020). Unsupervised Cross-Domain Singing Voice Conversion. http://arxiv.org/abs/2008.02830 ↩
Chengqi Deng, Chengzhu Yu, Heng Lu, Chao Weng, D. Y. (n.d.). PITCHNET: UNSUPERVISED SINGING VOICE CONVERSION WITH PITCH ADVERSARIAL NETWORK. 2–6. ↩ ↩
Yin-Jyun Luo, Chin-Cheng Hsu, Kat Agres, D. H. (2020). SINGING VOICE CONVERSION WITH DISENTANGLED REPRESENTATIONS OF SINGER AND VOCAL TECHNIQUE USING VARIATIONAL AUTOENCODERS. Icassp, 2–6. ↩
Polyak, A., Wolf, L., Adi, Y., & Taigman, Y. (2020). Unsupervised Cross-Domain Singing Voice Conversion. http://arxiv.org/abs/2008.02830 ↩
Chou, J., Yeh, C., Lee, H., & Lee, L. (2018). Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations. ↩
Liu, A. T., Hsu, P. C., & Lee, H. Y. (2019). Unsupervised end-to-end learning of discrete linguistic units for voice conversion. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 1108–1112. https://doi.org/10.21437/Interspeech.2019-2048 ↩
Sun, L., Li, K., Wang, H., Kang, S., & Meng, H. (2016). PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING Department of Systems Engineering and Engineering Management. IEEE International Conference on Multimedia and Expo (ICME). ↩ ↩
Qian, K., Zhang, Y., Chang, S., Yang, X., & Hasegawa-Johnson, M. (2019). AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss. http://arxiv.org/abs/1905.05879 ↩
Liu, S., Zhong, J., Sun, L., Wu, X., Liu, X., & Meng, H. (2018). Voice conversion across arbitrary speakers based on a single target-speaker utterance. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018-Septe(September), 496–500. https://doi.org/10.21437/Interspeech.2018-1504 ↩ ↩
Sun, L., Wang, H., Kang, S., Li, K., & Meng, H. (2016). Personalized, cross-lingual TTS using phonetic posteriorgrams. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 08-12-Sept, 322–326. https://doi.org/10.21437/Interspeech.2016-1043 ↩
Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio. 1–15. http://arxiv.org/abs/1609.03499 ↩
Chou, J., Yeh, C., & Lee, H. (2019). One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. ↩
Tech, C. (2000). Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. 2–4. http://openaccess.thecvf.com/content_ICCV_2017/papers/Huang_Arbitrary_Style_Transfer_ICCV_2017_paper.pdf ↩
Kaneko, T., & Kameoka, H. (2018). Cyclegan-VC: Non-parallel voice conversion using cycle-consistent adversarial networks. European Signal Processing Conference, 2018-September(Vcc 2016), 2100–2104. https://doi.org/10.23919/EUSIPCO.2018.8553236 ↩
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, N. H. (n.d.). CYCLEGAN-VC2: IMPROVED CYCLEGAN-BASED NON-PARALLEL VOICE CONVERSION. ↩ ↩
Kameoka, H., Kaneko, T., Tanaka, K., & Hojo, N. (2019). StarGAN-VC: Non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks. 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings, 266–273. https://doi.org/10.1109/SLT.2018.8639535 ↩ ↩ ↩
Kaneko, T., Kameoka, H., Tanaka, K., & Hojo, N. (n.d.). StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion. ↩ ↩ ↩
Hsu, C. C., Hwang, H. Te, Wu, Y. C., Tsao, Y., & Wang, H. M. (2017). Voice conversion from non-parallel corpora using variational auto-encoder. 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. https://doi.org/10.1109/APSIPA.2016.7820786 ↩
Tobing, P. L., Wu, Y. C., Hayashi, T., Kobayashi, K., & Toda, T. (2019). Non-parallel voice conversion with cyclic variational autoencoder. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 674–678. https://doi.org/10.21437/Interspeech.2019-2307 ↩
Kingma, D. P., Rezende, D. J., Mohamed, S., & Welling, M. (2009). Semi-supervised Learning with Deep Generative Models. 1–9. ↩
Kameoka, H., Kaneko, T., Tanaka, K., & Hojo, N. (2019). ACVAE-VC: Non-Parallel Voice Conversion with Auxiliary Classifier Variational Autoencoder. IEEE/ACM Transactions on Audio Speech and Language Processing, 27(9), 1432–1443. https://doi.org/10.1109/TASLP.2019.2917232 ↩
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems, 2180–2188. ↩
Hsu, C. C., Hwang, H. Te, Wu, Y. C., Tsao, Y., & Wang, H. M. (2017). Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017-Augus(2), 3364–3368. https://doi.org/10.21437/Interspeech.2017-63 ↩

Forward-inverse mapping	Inverse-forward mapping

$E_{x∼P_{Data}(x)}[ \parallel G_{Y→X}(G_{X→Y}(x)) − x\parallel_1]$	$E_{y∼P_{Data}(y)}[ \parallel G_{X→Y}(G_{Y→X}(y)) − y\parallel_1]$

Voice Conversion