Introduction
VC aims to convert the non-linguistic information of the speech signals while maintaining the linguistic content the same, which is helpful for:
- multi-speaker text-to-speech
- speech synthesis
- speech enhancement[1]
- data augmentation[2]
- pronunciation correction
- accent remove[3]
- voice quality of a non native speaker and the pronunciation patterns of a native speaker
- can be used in language learning
- speaking style conversion
- improving the speech intelligibility[7][8]
- surgical patients who have had parts of their articulators removed
- improve intelligibility and naturalness for deaf speaker's hearing-impaired speech
- singers voice conversion[9][10][11][12]
Categories
- Supervised VC with parallel data
- Unsupervised VC without parallel data
- Feature Disentangle: translating the speech to phoneme posterior sequences by ASR system, and then synthesizing the speech with the target domain synthesizer
- Direct Transformation: utilize deep generative model (GAN/VAE)
The encoder encode the source to the latent representation . The latent representation only contains the linguistic content so that the decoder could recover the source with the speaker identity while the classifier-1 cannot distingush the speakers[13].
The encoder-decoder architecture. | The classifier guides the model to learn to disentangle features. |
---|---|
So the feature distenganlement means the separation of linguistic content or phonetic content and speaker identity. The objective of the classifier or discriminator is to adversarially guide the model to learn to disentangle features. But how to maintain the linguistic content? One way is to use the (ASR task) pre-trained content encoder[14][15][16][17]. For the speaker identity, one hot vector for each speaker, or make use of speaker embedding ( i-vector, d-vector, x-vector, etc… )
Feature Disentanglement
ASR pre-trained content encoder | Instance normalization |
---|---|
PPG-based methods
Phonetic posteriorgram (PPG) is obtained from a speaker-independent automatic speech recognition (SI-ASR) system[15][18]. These PPGs can represent articulation of speech sounds in a speaker-normalized space and correspond to spoken content speaker-independently.
A PPG is a time-versus-class matrix representing the posterior probabilities of each phonetic class for each specific time frame of one utterance. A phonetic class may refer to a word, a phone or a senone.
As illustrated in Figure above, the proposed approach is divided into three stages: training stage 1, training stage 2 and the conversion stage. The role of the SI-ASR model is to obtain a PPGs representation of the input speech. Training stage 2 models the relationships between the PPGs and Mel-cepstral coefficients (MCEPs) features of the target speaker for speech parameter generation. The conversion stage drives the trained DBLSTM model with PPGs of the source speech (obtained from the same SI-ASR) for VC.
In [3], the authors proposed a PPG system for foreign accent conversion (FAC). They use an acoustic model trained on a native speech corpus to extract speaker-independent phonetic posteriorgrams (PPGs), and then train a speech synthesizer to map PPGs from the non-native speaker into the corresponding spectral fea- tures, which in turn are converted into the audio waveform using a high-quality neural vocoder. At runtime, we drive the synthesizer with the PPG extracted from a native reference ut- terance.
Overall workflow of the proposed PPG system | PPG-to-Mel conversion model |
---|---|
MSVC model with i-vector
In [17], two different systems to achieve any-to-any VC:
- The i-vector-based VC (IVC) system
- The speaker- encoder-based VC (SEVC) system.
Both systems train a bidirectional long-short term memory (DBLSTM) based multi-speaker voice conversion (MSVC) model. The IVC system uses i-vectors to encode speakerIDs, while the SEVC system uses learnable speaker embeddings to encode speakerIDs.
The key observations from the results are as follows:
- Both the proposed IVC and SEVC systems can achieve VC across a new source-target speaker pair using only one target-speaker utterance. The converted speech has desirable quality and similarity.
- The IVC system is superior to the SEVC system in terms of the converted speech’s quality and similarity
PitchNet
A singing voice conversion method proposed in [10]. The proposed PitchNet added an adversarially trained pitch regression network to enforce the encoder network to learn pitch invariant phoneme representation, and a separate module to feed pitch extracted from the source audio to the decoder network.
PitchNet consists of five parts, an encoder, a decoder, a Look Up Table (LUT) of singer embedding vectors, a singer classification network and a pitch regression network. The audio waveform is directly fed into the encoder. The output of the encoder, the singer embedding vector retrieved from LUT and the input pitch are concatenated together to condition on the WaveNet[19] decoder to output audio waveform.
The overall architecture of PitchNet | The architecture of the singer classification network | The architecture of the pitch regression network |
---|---|---|
Instance Normalization (IN) for feature disentanglement is applied in [20]. It shows that simply adding instance normalization without affine transformation to can remove the speaker information while preserving the content information. To further enforce the speaker encoder to generate speaker representation, we provide the speaker information to decoder by adaptive instance normalization (adaIN) layer. The idea is from style transfer in computer vision [21].
The idea of IN is also applied in [4], where the speech signal is decomposed into an emotion-invariant content code and an emotion-related style code in latent space. Emotion conversion is performed by extracting and recombining the content code of the source speech and the style code of the target emotion.
Direct Transformation
A straightforwad method is to use the cycle consistency loss by cycleGAN[22][23][1] and StarGAN-VC[24][25].
CycleGAN | StarGAN |
---|---|
CycleGAN
Adversarial loss
To make a converted feature indistinguishable from a target . Restricts to follow the target distribution.
Cycle-consistency loss
Guarantee the linguistic consistency between input and output features. The adversarial loss would not necessarily guarantee that the contextual information of and will be consistent. This is because the adversarial loss only tells us whether follows the target data distribution and does not help preserve the contextual information of . There are infinitely many mappings that will induce the same output distributions, adversarial loss cannot guarantee the linguistic consistency.
Encourage bijection. Encourage and to find pairs with the same contextual information.
Forward-inverse mapping Inverse-forward mapping
Identity-mapping loss
To further encourage the input preservation[23], as the second adversarial loss.
One-step adversarial loss Two-step adversarial losses
In [5], the Wasserstein distance metric (WGAN loss) with gradient penalty is considered.
StarGAN
For each pair of conversion , CycleGAN need unique discriminator. For 100 speakers a group of generators and discriminators are required. StarGAN-VC[24][25] is introduced to solve the problem, in which all speakers share the same generator and discriminator. The aim of StarGAN-VC is to obtain a single generator that learns mappings among multiple domains. To achieve this, StarGAN-VC extends CycleGAN-VC to a conditional setting with a domain code (e.g., a speaker iden- tifier). More precisely, StarGAN-VC learns a generator that converts an input acoustic feature into an output feature conditioned on the target domain code , i.e., .
In [24], for the CycleGAN and StarGAN are compared.
Concept of CycleGAN training. | Concept of StarGAN training. |
---|---|
Adversarial losses: Generators and want to fool the dscriminator and . and attempt to avoid being fooled by them. | Adversarial Loss: Generator takes acoustic feature and target attribute label as input. wants to fool the discriminator and avoids it. |
Cycle consistency loss: Guarantee that or will preserve the linguistic information of the input speech. Encourage and | Domain classification loss: For classifier and generator . Encourage correctly classifies and as belonging to attribute . |
. | Cycle consistency loss: To encourage to be a bijection and guarantee that will pre- serve the linguistic information of input speech. |
. | Identity mapping loss: Ensure that an input into will remain unchanged when the input already belongs to the target attribute . |
In [25], a source-and-target conditional adversarial loss is developed.
- (a) In the classification loss, prefers to generate classifiable (i.e., far from the decision boundary) data.
- (b) In the target conditional adversarial loss, needs to simultaneously handle hard negative samples (e.g., conversion between the same speaker ) and easy negative samples (e.g., conversion between completely different speakers ).
- (c) The proposed source-and-target conditional adversarial loss can bring all the converted data close to the target data in both source-wise and target-wise manners. This resolves the unfair training condition in the target conditional adversarial loss (Figure (b)) and allows all the source domain data to be converted into the target domain data.
where is randomly sampled independently of real data. Both and are conditioned on the source code in addition to the target code .
The source-and-target conditional generator requires the availability of the source code in inference, which is not required in the conventional StarGAN-VC.
For the objective evaluation, use Mel-cepstral distortion (MCD) and modulation spectra distance (MSD).
For the subjective evaluation, a mean opinion score (MOS) test (5: excellent to 1: bad) is conducted.
VAE
The conversion function is reformulated as an auto-encoder. The encoder is designed to be speaker-independent and converts an observed frame into speaker-independent latent variable or code . Presumably, contains information that is irrelevant to speaker, such as phonetic variations, is also refered to as phonetic representation[26].
The decoder reconstructs speaker-dependent frames with speaker representation as another latent variable and with . The reconstruct a speaker-dependent frame .
The VAE guarrentee that the latent variable is Gaussian, that is speakers normalized or speaker-independent and can be regarded as linguistic information.
In [27], CycleVAE-based VC is proposed. CycleVAE is capable of recycling the converted spectra back into the system, so that the conversion flow is indirectly considered in the parameter optimization.
Conditional VAEs (CVAEs) are an extended version of VAEs with the only difference being that the encoder and decoder networks can take an auxiliary variable as an additional input[28].
The regular CVAEs impose no restrictions on the manner in which the encoder and decoder may use the attribute class label c. Hence, the encoder and decoder are free to ignore c by finding distributions satisfying and . This can occur for instance when the encoder and decoder have sufficient capacity to recon- struct any data without using . To avoid such situations, in ACVAE[29], an information-theoretic regularization [30] from InfoGAN is introduced to assist the decoder output to be correlated as far as possible with . ACVAE means VAE with an auxiliary classifie (AC), is shown as bellow.
In [31], a non-parallel VC framework with a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) is proposed. The model explicitly considers a VC objective when building the speech model. The speech model is modeling with CVAE and improved with WGAN by modifying the loss function.
Reference
-
Mimura, M., Sakai, S., & Kawahara, T. (2017). CROSS-DOMAIN SPEECH RECOGNITION USING NONPARALLEL CORPORA WITH CYCLE-CONSISTENT ADVERSARIAL NETWORKS. 134–140. ↩ ↩
-
Gokce Keskin, Tyler Lee, Cory Stephenson, O. H. E. (2019). Measuring the Effectiveness of Voice Conversion on Speaker Identification and Automatic Speech Recognition Systems Gokce. ↩
-
Zhao, G., Ding, S., & Gutierrez-osuna, R. (n.d.). Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams. 2–6. ↩ ↩
-
Gao, J., Chakraborty, D., Tembine, H., & Olaleye, O. (2019). Nonparallel Emotional Speech Conversion. 2858–2862. ↩ ↩
-
Seshadri, S., Juvela, L., Yamagishi, J., Rasanen, O., & Alku, P. (n.d.). CYCLE-CONSISTENT ADVERSARIAL NETWORKS FOR NON-PARALLEL VOCAL EFFORT BASED SPEAKING STYLE CONVERSION. ↩ ↩
-
Patel, M., Parmar, M., Doshi, S., Shah, N., & Patil, H. A. (2019). Novel Inception-GAN for Whisper-to-Normal Speech Conversion. September, 7–9. ↩
-
Chen, L., Lee, H., & Tsao, Y. (n.d.). Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech. 2–6. ↩
-
Biadsy, F., Weiss, R. J., Moreno, P. J., Kanevsky, D., & Jia, Y. (n.d.). Parrotron : An End-to-End Speech-to-Speech Conversion Model and its Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications. ↩
-
Polyak, A., Wolf, L., Adi, Y., & Taigman, Y. (2020). Unsupervised Cross-Domain Singing Voice Conversion. http://arxiv.org/abs/2008.02830 ↩
-
Chengqi Deng, Chengzhu Yu, Heng Lu, Chao Weng, D. Y. (n.d.). PITCHNET: UNSUPERVISED SINGING VOICE CONVERSION WITH PITCH ADVERSARIAL NETWORK. 2–6. ↩ ↩
-
Yin-Jyun Luo, Chin-Cheng Hsu, Kat Agres, D. H. (2020). SINGING VOICE CONVERSION WITH DISENTANGLED REPRESENTATIONS OF SINGER AND VOCAL TECHNIQUE USING VARIATIONAL AUTOENCODERS. Icassp, 2–6. ↩
-
Polyak, A., Wolf, L., Adi, Y., & Taigman, Y. (2020). Unsupervised Cross-Domain Singing Voice Conversion. http://arxiv.org/abs/2008.02830 ↩
-
Chou, J., Yeh, C., Lee, H., & Lee, L. (2018). Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations. ↩
-
Liu, A. T., Hsu, P. C., & Lee, H. Y. (2019). Unsupervised end-to-end learning of discrete linguistic units for voice conversion. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 1108–1112. https://doi.org/10.21437/Interspeech.2019-2048 ↩
-
Sun, L., Li, K., Wang, H., Kang, S., & Meng, H. (2016). PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING Department of Systems Engineering and Engineering Management. IEEE International Conference on Multimedia and Expo (ICME). ↩ ↩
-
Qian, K., Zhang, Y., Chang, S., Yang, X., & Hasegawa-Johnson, M. (2019). AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss. http://arxiv.org/abs/1905.05879 ↩
-
Liu, S., Zhong, J., Sun, L., Wu, X., Liu, X., & Meng, H. (2018). Voice conversion across arbitrary speakers based on a single target-speaker utterance. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018-Septe(September), 496–500. https://doi.org/10.21437/Interspeech.2018-1504 ↩ ↩
-
Sun, L., Wang, H., Kang, S., Li, K., & Meng, H. (2016). Personalized, cross-lingual TTS using phonetic posteriorgrams. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 08-12-Sept, 322–326. https://doi.org/10.21437/Interspeech.2016-1043 ↩
-
Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio. 1–15. http://arxiv.org/abs/1609.03499 ↩
-
Chou, J., Yeh, C., & Lee, H. (2019). One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. ↩
-
Tech, C. (2000). Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. 2–4. http://openaccess.thecvf.com/content_ICCV_2017/papers/Huang_Arbitrary_Style_Transfer_ICCV_2017_paper.pdf ↩
-
Kaneko, T., & Kameoka, H. (2018). Cyclegan-VC: Non-parallel voice conversion using cycle-consistent adversarial networks. European Signal Processing Conference, 2018-September(Vcc 2016), 2100–2104. https://doi.org/10.23919/EUSIPCO.2018.8553236 ↩
-
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, N. H. (n.d.). CYCLEGAN-VC2: IMPROVED CYCLEGAN-BASED NON-PARALLEL VOICE CONVERSION. ↩ ↩
-
Kameoka, H., Kaneko, T., Tanaka, K., & Hojo, N. (2019). StarGAN-VC: Non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks. 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings, 266–273. https://doi.org/10.1109/SLT.2018.8639535 ↩ ↩ ↩
-
Kaneko, T., Kameoka, H., Tanaka, K., & Hojo, N. (n.d.). StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion. ↩ ↩ ↩
-
Hsu, C. C., Hwang, H. Te, Wu, Y. C., Tsao, Y., & Wang, H. M. (2017). Voice conversion from non-parallel corpora using variational auto-encoder. 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. https://doi.org/10.1109/APSIPA.2016.7820786 ↩
-
Tobing, P. L., Wu, Y. C., Hayashi, T., Kobayashi, K., & Toda, T. (2019). Non-parallel voice conversion with cyclic variational autoencoder. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 674–678. https://doi.org/10.21437/Interspeech.2019-2307 ↩
-
Kingma, D. P., Rezende, D. J., Mohamed, S., & Welling, M. (2009). Semi-supervised Learning with Deep Generative Models. 1–9. ↩
-
Kameoka, H., Kaneko, T., Tanaka, K., & Hojo, N. (2019). ACVAE-VC: Non-Parallel Voice Conversion with Auxiliary Classifier Variational Autoencoder. IEEE/ACM Transactions on Audio Speech and Language Processing, 27(9), 1432–1443. https://doi.org/10.1109/TASLP.2019.2917232 ↩
-
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems, 2180–2188. ↩
-
Hsu, C. C., Hwang, H. Te, Wu, Y. C., Tsao, Y., & Wang, H. M. (2017). Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017-Augus(2), 3364–3368. https://doi.org/10.21437/Interspeech.2017-63 ↩