Voice Conversion

Introduction

VC aims to convert the non-linguistic information of the speech signals while maintaining the linguistic content the same, which is helpful for:

  • multi-speaker text-to-speech
  • speech synthesis
  • speech enhancement[1]
  • data augmentation[2]
  • pronunciation correction
  • accent remove[3]
    • voice quality of a non native speaker and the pronunciation patterns of a native speaker
    • can be used in language learning
  • speaking style conversion
    • emotion[4]
    • normal to Lombard [5]
    • whisper to normal[6]
  • improving the speech intelligibility[7][8]
    • surgical patients who have had parts of their articulators removed
    • improve intelligibility and naturalness for deaf speaker's hearing-impaired speech
  • singers voice conversion[9][10][11][12]

Categories

  • Supervised VC with parallel data
  • Unsupervised VC without parallel data
    • Feature Disentangle: translating the speech to phoneme posterior sequences by ASR system, and then synthesizing the speech with the target domain synthesizer
    • Direct Transformation: utilize deep generative model (GAN/VAE)

The encoder encode the source x to the latent representation \textrm{enc}(x). The latent representation \textrm{enc}(x) only contains the linguistic content so that the decoder could recover the source with the speaker identity while the classifier-1 cannot distingush the speakers[13].

The encoder-decoder architecture. The classifier guides the model to learn to disentangle features.

So the feature distenganlement means the separation of linguistic content or phonetic content and speaker identity. The objective of the classifier or discriminator is to adversarially guide the model to learn to disentangle features. But how to maintain the linguistic content? One way is to use the (ASR task) pre-trained content encoder[14][15][16][17]. For the speaker identity, one hot vector for each speaker, or make use of speaker embedding ( i-vector, d-vector, x-vector, etc… )

Feature Disentanglement

The schematic architecture of voice conversion network. We employ an encoder E, a domain confusion network C and a conditional decoder D.
ASR pre-trained content encoder Instance normalization

PPG-based methods

Phonetic posteriorgram (PPG) is obtained from a speaker-independent automatic speech recognition (SI-ASR) system[15][18]. These PPGs can represent articulation of speech sounds in a speaker-normalized space and correspond to spoken content speaker-independently.

A PPG is a time-versus-class matrix representing the posterior probabilities of each phonetic class for each specific time frame of one utterance. A phonetic class may refer to a word, a phone or a senone.

Schematic diagram of VC with PPGs. SI stands for speaker-independent. Target speech and source speech do not have any overlapped portion.

As illustrated in Figure above, the proposed approach is divided into three stages: training stage 1, training stage 2 and the conversion stage. The role of the SI-ASR model is to obtain a PPGs representation of the input speech. Training stage 2 models the relationships between the PPGs and Mel-cepstral coefficients (MCEPs) features of the target speaker for speech parameter generation. The conversion stage drives the trained DBLSTM model with PPGs of the source speech (obtained from the same SI-ASR) for VC.

In [3], the authors proposed a PPG system for foreign accent conversion (FAC). They use an acoustic model trained on a native speech corpus to extract speaker-independent phonetic posteriorgrams (PPGs), and then train a speech synthesizer to map PPGs from the non-native speaker into the corresponding spectral fea- tures, which in turn are converted into the audio waveform using a high-quality neural vocoder. At runtime, we drive the synthesizer with the PPG extracted from a native reference ut- terance.

Overall workflow of the proposed PPG system PPG-to-Mel conversion model

MSVC model with i-vector

In [17], two different systems to achieve any-to-any VC:

  • The i-vector-based VC (IVC) system
  • The speaker- encoder-based VC (SEVC) system.

Both systems train a bidirectional long-short term memory (DBLSTM) based multi-speaker voice conversion (MSVC) model. The IVC system uses i-vectors to encode speakerIDs, while the SEVC system uses learnable speaker embeddings to encode speakerIDs.


Schematic diagram of an any-to-any voice conversion system.

The key observations from the results are as follows:

  • Both the proposed IVC and SEVC systems can achieve VC across a new source-target speaker pair using only one target-speaker utterance. The converted speech has desirable quality and similarity.
  • The IVC system is superior to the SEVC system in terms of the converted speech’s quality and similarity

PitchNet

A singing voice conversion method proposed in [10]. The proposed PitchNet added an adversarially trained pitch regression network to enforce the encoder network to learn pitch invariant phoneme representation, and a separate module to feed pitch extracted from the source audio to the decoder network.
PitchNet consists of five parts, an encoder, a decoder, a Look Up Table (LUT) of singer embedding vectors, a singer classification network and a pitch regression network. The audio waveform is directly fed into the encoder. The output of the encoder, the singer embedding vector retrieved from LUT and the input pitch are concatenated together to condition on the WaveNet[19] decoder to output audio waveform.

The overall architecture of PitchNet The architecture of the singer classification network The architecture of the pitch regression network

Instance Normalization (IN) for feature disentanglement is applied in [20]. It shows that simply adding instance normalization without affine transformation to E_c can remove the speaker information while preserving the content information. To further enforce the speaker encoder to generate speaker representation, we provide the speaker information to decoder by adaptive instance normalization (adaIN) layer. The idea is from style transfer in computer vision [21].

Model overview. Es is speaker encoder; Ec is content encoder and D is decoder. IN is instance normalization layer without affine transformation. AdaIN represents adaptive instance normalization layer. duces

The idea of IN is also applied in [4], where the speech signal is decomposed into an emotion-invariant content code and an emotion-related style code in latent space. Emotion conversion is performed by extracting and recombining the content code of the source speech and the style code of the target emotion.

Direct Transformation

A straightforwad method is to use the cycle consistency loss by cycleGAN[22][23][1] and StarGAN-VC[24][25].

CycleGAN StarGAN

CycleGAN

Adversarial loss
To make a converted feature G_{X→Y}(x) indistinguishable from a target y. Restricts G_{X→Y}(x) to follow the target distribution.

Cycle-consistency loss
Guarantee the linguistic consistency between input and output features. The adversarial loss would not necessarily guarantee that the contextual information of x and G_{X→Y}(x) will be consistent. This is because the adversarial loss only tells us whether G_{X→Y}(x) follows the target data distribution and does not help preserve the contextual information of x. There are infinitely many mappings that will induce the same output distributions, adversarial loss cannot guarantee the linguistic consistency.
Encourage bijection. Encourage G_{X→Y} and G_{Y→X} to find (x, y) pairs with the same contextual information.

Forward-inverse mapping Inverse-forward mapping
E_{x∼P_{Data}(x)}[ \parallel G_{Y→X}(G_{X→Y}(x)) − x\parallel_1] E_{y∼P_{Data}(y)}[ \parallel G_{X→Y}(G_{Y→X}(y)) − y\parallel_1]

Identity-mapping loss
To further encourage the input preservation[23], as the second adversarial loss.

One-step adversarial loss Two-step adversarial losses

In [5], the Wasserstein distance metric (WGAN loss) with gradient penalty is considered.

StarGAN

For each pair of conversion (X,Y), CycleGAN need unique discriminator. For 100 speakers a group of generators and discriminators are required. StarGAN-VC[24][25] is introduced to solve the problem, in which all speakers share the same generator and discriminator. The aim of StarGAN-VC is to obtain a single generator G that learns mappings among multiple domains. To achieve this, StarGAN-VC extends CycleGAN-VC to a conditional setting with a domain code (e.g., a speaker iden- tifier). More precisely, StarGAN-VC learns a generator G that converts an input acoustic feature x into an output feature x' conditioned on the target domain code c', i.e., G(x, c') → x'.
In [24], for the CycleGAN and StarGAN are compared.

Concept of CycleGAN training. Concept of StarGAN training.
Adversarial losses: Generators G and F want to fool the dscriminator D_X and D_Y. D_X and D_Y attempt to avoid being fooled by them. Adversarial Loss: Generator G takes acoustic feature and target attribute label c as input. G wants to fool the discriminator D and D avoids it.
Cycle consistency loss: Guarantee that G or F will preserve the linguistic information of the input speech. Encourage F(G(x)) ≃ x and G(F(y)) ≃ y Domain classification loss: For classifier C and generator G. Encourage C correctly classifies y \sim p(y | c) and G(x, c) as belonging to attribute c.
. Cycle consistency loss: To encourage G(x, c) to be a bijection and guarantee that G will pre- serve the linguistic information of input speech.
. Identity mapping loss: Ensure that an input into G will remain unchanged when the input already belongs to the target attribute c′.

In [25], a source-and-target conditional adversarial loss is developed.

Comparison ofconditional methods in training objectives. “A” and “B” denote the domain codes and “A → B” represents the data converted from “A” to “B.” Circle and tri- angle markers denote real and fake data, respectively.
  • (a) In the classification loss, G prefers to generate classifiable (i.e., far from the decision boundary) data.
  • (b) In the target conditional adversarial loss, D needs to simultaneously handle hard negative samples (e.g., conversion between the same speaker A → A) and easy negative samples (e.g., conversion between completely different speakers B → A).
    \begin{equation} \begin{split} \mathcal{L}_{t-adv} &= E_{(x,c)∼P(x,c)}[\log D(x, c)] \\ & + E_{x∼P(x),c'∼P(c')}[\log(1 −D(G(x, c'), c'))] \\ \end{split} \end{equation}
  • (c) The proposed source-and-target conditional adversarial loss can bring all the converted data close to the target data in both source-wise and target-wise manners. This resolves the unfair training condition in the target conditional adversarial loss (Figure (b)) and allows all the source domain data to be converted into the target domain data.
    \begin{equation} \begin{split} \mathcal{L}_{st-adv} &= E_{(x,c)\sim P(x,c),c'\sim P(c')}[\log D(x, c', c)] \\ & + E_{(x,c)\sim P(x,c),c'\sim P(c')} [\log D(G(x, c, c'), c, c')] \\ \end{split} \end{equation}
    where c' \sim P(c') is randomly sampled independently of real data. Both G and D are conditioned on the source code c' in addition to the target code c.
    The source-and-target conditional generator requires the availability of the source code c' in inference, which is not required in the conventional StarGAN-VC.

For the objective evaluation, use Mel-cepstral distortion (MCD) and modulation spectra distance (MSD).

Comparison of MCD and MSD among models using different conditional methods in training objectives. We fix the StarGAN-VC2 conditional method in G network as modulation-based.

For the subjective evaluation, a mean opinion score (MOS) test (5: excellent to 1: bad) is conducted.

MOS for naturalness with 95% confidence intervals.

VAE

The conversion function is reformulated as an auto-encoder. The encoder f_{\phi}(·) is designed to be speaker-independent and converts an observed frame into speaker-independent latent variable or code z_n = f_{\phi}(x_n). Presumably, z_n contains information that is irrelevant to speaker, such as phonetic variations, z_n is also refered to as phonetic representation[26].
The decoder f_{\theta}(·) reconstructs speaker-dependent frames with speaker representation y_n as another latent variable and with z_n. The reconstruct a speaker-dependent frame \hat{x}_n = f_{\theta}(z_n, y_n)=f_{\theta}(f_{\phi}(x_n), y_n).

Illustration of VAE-based non-parallel spectral conver- sion. The dashed line means copying. The latent variable z and y are concatenated.

The VAE guarrentee that the latent variable z_n is Gaussian, that is speakers normalized or speaker-independent and can be regarded as linguistic information.
In [27], CycleVAE-based VC is proposed. CycleVAE is capable of recycling the converted spectra back into the system, so that the conversion flow is indirectly considered in the parameter optimization.

Flow ofthe conventional VAE-based (upper-part) and the proposed CycleVAE-based (whole diagram) VC. Converted input features include converted excitation features, such as lin- early transformed F0 values. One full-cycle includes the esti- mation of both reconstructed and cyclic reconstructed spectra. Each ofencoder and decoder networks are shared for all cycles.

Conditional VAEs (CVAEs) are an extended version of VAEs with the only difference being that the encoder and decoder networks can take an auxiliary variable c as an additional input[28].
The regular CVAEs impose no restrictions on the manner in which the encoder and decoder may use the attribute class label c. Hence, the encoder and decoder are free to ignore c by finding distributions satisfying q_\phi(z|x, c) = q_\phi(z|x) and p_\theta(x|z, c) = p_\theta(x|z). This can occur for instance when the encoder and decoder have sufficient capacity to recon- struct any data without using c. To avoid such situations, in ACVAE[29], an information-theoretic regularization [30] from InfoGAN is introduced to assist the decoder output to be correlated as far as possible with c. ACVAE means VAE with an auxiliary classifie (AC), is shown as bellow.

Illustration of ACVAE-VC.

In [31], a non-parallel VC framework with a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) is proposed. The model explicitly considers a VC objective when building the speech model. The speech model is modeling with CVAE and improved with WGAN by modifying the loss function.

Reference


  1. Mimura, M., Sakai, S., & Kawahara, T. (2017). CROSS-DOMAIN SPEECH RECOGNITION USING NONPARALLEL CORPORA WITH CYCLE-CONSISTENT ADVERSARIAL NETWORKS. 134–140.

  2. Gokce Keskin, Tyler Lee, Cory Stephenson, O. H. E. (2019). Measuring the Effectiveness of Voice Conversion on Speaker Identification and Automatic Speech Recognition Systems Gokce.

  3. Zhao, G., Ding, S., & Gutierrez-osuna, R. (n.d.). Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams. 2–6.

  4. Gao, J., Chakraborty, D., Tembine, H., & Olaleye, O. (2019). Nonparallel Emotional Speech Conversion. 2858–2862.

  5. Seshadri, S., Juvela, L., Yamagishi, J., Rasanen, O., & Alku, P. (n.d.). CYCLE-CONSISTENT ADVERSARIAL NETWORKS FOR NON-PARALLEL VOCAL EFFORT BASED SPEAKING STYLE CONVERSION.

  6. Patel, M., Parmar, M., Doshi, S., Shah, N., & Patil, H. A. (2019). Novel Inception-GAN for Whisper-to-Normal Speech Conversion. September, 7–9.

  7. Chen, L., Lee, H., & Tsao, Y. (n.d.). Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech. 2–6.

  8. Biadsy, F., Weiss, R. J., Moreno, P. J., Kanevsky, D., & Jia, Y. (n.d.). Parrotron : An End-to-End Speech-to-Speech Conversion Model and its Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications.

  9. Polyak, A., Wolf, L., Adi, Y., & Taigman, Y. (2020). Unsupervised Cross-Domain Singing Voice Conversion. http://arxiv.org/abs/2008.02830

  10. Chengqi Deng, Chengzhu Yu, Heng Lu, Chao Weng, D. Y. (n.d.). PITCHNET: UNSUPERVISED SINGING VOICE CONVERSION WITH PITCH ADVERSARIAL NETWORK. 2–6.

  11. Yin-Jyun Luo, Chin-Cheng Hsu, Kat Agres, D. H. (2020). SINGING VOICE CONVERSION WITH DISENTANGLED REPRESENTATIONS OF SINGER AND VOCAL TECHNIQUE USING VARIATIONAL AUTOENCODERS. Icassp, 2–6.

  12. Polyak, A., Wolf, L., Adi, Y., & Taigman, Y. (2020). Unsupervised Cross-Domain Singing Voice Conversion. http://arxiv.org/abs/2008.02830

  13. Chou, J., Yeh, C., Lee, H., & Lee, L. (2018). Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations.

  14. Liu, A. T., Hsu, P. C., & Lee, H. Y. (2019). Unsupervised end-to-end learning of discrete linguistic units for voice conversion. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 1108–1112. https://doi.org/10.21437/Interspeech.2019-2048

  15. Sun, L., Li, K., Wang, H., Kang, S., & Meng, H. (2016). PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING Department of Systems Engineering and Engineering Management. IEEE International Conference on Multimedia and Expo (ICME).

  16. Qian, K., Zhang, Y., Chang, S., Yang, X., & Hasegawa-Johnson, M. (2019). AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss. http://arxiv.org/abs/1905.05879

  17. Liu, S., Zhong, J., Sun, L., Wu, X., Liu, X., & Meng, H. (2018). Voice conversion across arbitrary speakers based on a single target-speaker utterance. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018-Septe(September), 496–500. https://doi.org/10.21437/Interspeech.2018-1504

  18. Sun, L., Wang, H., Kang, S., Li, K., & Meng, H. (2016). Personalized, cross-lingual TTS using phonetic posteriorgrams. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 08-12-Sept, 322–326. https://doi.org/10.21437/Interspeech.2016-1043

  19. Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio. 1–15. http://arxiv.org/abs/1609.03499

  20. Chou, J., Yeh, C., & Lee, H. (2019). One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization.

  21. Tech, C. (2000). Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. 2–4. http://openaccess.thecvf.com/content_ICCV_2017/papers/Huang_Arbitrary_Style_Transfer_ICCV_2017_paper.pdf

  22. Kaneko, T., & Kameoka, H. (2018). Cyclegan-VC: Non-parallel voice conversion using cycle-consistent adversarial networks. European Signal Processing Conference, 2018-September(Vcc 2016), 2100–2104. https://doi.org/10.23919/EUSIPCO.2018.8553236

  23. Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, N. H. (n.d.). CYCLEGAN-VC2: IMPROVED CYCLEGAN-BASED NON-PARALLEL VOICE CONVERSION.

  24. Kameoka, H., Kaneko, T., Tanaka, K., & Hojo, N. (2019). StarGAN-VC: Non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks. 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings, 266–273. https://doi.org/10.1109/SLT.2018.8639535

  25. Kaneko, T., Kameoka, H., Tanaka, K., & Hojo, N. (n.d.). StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion.

  26. Hsu, C. C., Hwang, H. Te, Wu, Y. C., Tsao, Y., & Wang, H. M. (2017). Voice conversion from non-parallel corpora using variational auto-encoder. 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. https://doi.org/10.1109/APSIPA.2016.7820786

  27. Tobing, P. L., Wu, Y. C., Hayashi, T., Kobayashi, K., & Toda, T. (2019). Non-parallel voice conversion with cyclic variational autoencoder. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 674–678. https://doi.org/10.21437/Interspeech.2019-2307

  28. Kingma, D. P., Rezende, D. J., Mohamed, S., & Welling, M. (2009). Semi-supervised Learning with Deep Generative Models. 1–9.

  29. Kameoka, H., Kaneko, T., Tanaka, K., & Hojo, N. (2019). ACVAE-VC: Non-Parallel Voice Conversion with Auxiliary Classifier Variational Autoencoder. IEEE/ACM Transactions on Audio Speech and Language Processing, 27(9), 1432–1443. https://doi.org/10.1109/TASLP.2019.2917232

  30. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems, 2180–2188.

  31. Hsu, C. C., Hwang, H. Te, Wu, Y. C., Tsao, Y., & Wang, H. M. (2017). Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017-Augus(2), 3364–3368. https://doi.org/10.21437/Interspeech.2017-63

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,390评论 5 459
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,821评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,632评论 0 319
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,170评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 61,033评论 4 355
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,098评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,511评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,204评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,479评论 1 290
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,572评论 2 309
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,341评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,213评论 3 312
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,576评论 3 298
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,893评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,171评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,486评论 2 341
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,676评论 2 335