CNN在音乐识别中卷积核的设计

原文地址:http://www.jordipons.me/cnn-filter-shapes-discussion/

We aim to study how deep learning techniques can learn generalizable musical concepts. For doing so, we discuss which musical concepts can be fitted under the constraint of an specific CNN filter shape.

Several architectures can be combined to construct deep learning models: feed-forward neural networks, RNNs or CNNs. However, since the goal of our work is to understand which (musical) features deep learning models are learning, CNNs seemed an intuitive choice regarding that it is common to feed CNNs with spectrograms. Spectrograms have a meaning in time and in frequency and therefore, the resulting CNN filters will have interpretable dimensions (at least) in the first layer: time and frequency. This basic observation, motivates the following discussion.

Figure 1. Discussed filter shapes. From left to right: squared/rectangular filter, temporal filter and frequency filter.

1. CNN filter shapes discussion

Due to the CNNs success in the computer vision research field, its literature significantly influenced the music informatics research (MIR) community. In the image processing literature, squared small CNNs filters (ie. 3×3 or 7×7) are common. As a result of that, MIR researchers tend to use similar filter shape setups. However, note that the image processing filter dimensions have spatial meaning, while the audio spectrograms filters dimensions correspond to time and frequency. Therefore, wider filters may be capable of learning longer temporal dependencies in the audio domain while higher filters may be capable of learning more spread timbral features.

In order to motivate researchers to be conscious about the potential impact of choosing one filter shape or another, three examples and a use case are discussed in the following. Throughout this post we assume the spectrogram dimensions to be M-by-N, the filter dimensions to be m-by-n and the feature map dimensions to be M’-by-N’. M, m and M’ standing for the number of frequency bins and N, n and N’for the number of time frames:

  • Squared/rectangular filters (m-by-n filters) are capable of learning time and frequency features at the same time. This kind of filter is one of the most used in the music technology literature. Such filters can learn different musical aspects depending on how m and n are set. For example, a bass or a kick could be well modeled with a small filter (m << M and n << N, representing a sub-band for a short-time) because: these instruments are sufficiently characterized by the lower bands of the spectrum and the temporal evolution of the bass notes or a kick is not so long. An interesting interpretation of such small filters is that they can be considered pitch invariant to some extent.
    Note that the convolution happens in both (time and frequency) domains and therefore, the inherent frequency convolution in CNNs is a pitch shifting. However, such pitch invariability would not hold for instruments having a large pitch range since the timbre of an instrument changes accordingly to its pitch. But depending on the input spectrogram representation (ie. CQT, MEL or STFT) CNNs might be capable of learning more robust pitch invariant features. CQT is specially suited for achieving pitch invariant features since the relative positions of the harmonics remain constant regardless the f0, what makes the timbre signature less variant for all possible pitches of an harmonic instrument. This contrasts with the timbre representation achieved with STFT, that is f0 dependent. CQT can be thought as a STFT mapping done by series of logarithmically spaced averages – that are spaced in a similar way as octaves are distributed in frequency. This log-based transform achieves constant inter-harmonic spacings, what might facilitate CNNs to learn pitch invariant representations. Finally, note that MEL spectrograms might permit learning features that are more pitch invariant than with STFT – because MEL spectrograms are based in a log-based perceptual scale of pitches. However, in theory, MEL spectrograms are not as good as CQT because they are not grounded by the same motivations but for mapping human music perception.
    As another example, cymbals or snare drums(钹和军鼓) ——that are broad in frequency with a fixed decay time—— could be suitably modeled setting m = M and n << N. Please note that a bass or a kick could also be modeled with this filter, however: (i) the pitch invariance interpretation will not hold because its dimensions (m=M) do not allow the filter to convolve along frequency and therefore, pitch will be encoded together with timbre (meaning that, in order to characterize the timbre for the whole pitch range of an instrument, a filter per note is needed), what leads to a less efficient representation; and (ii) most of the weights would be set to zero, waisting part of the representational power of the CNN filter – because most of the relevant information is basically concentrated in the lower bands of the spectrum.
    As a final example, we want to point that squared/rectangular filters might be capable of modeling music motives as well. A music motive is a succession of (close) notes that occur synchronized with a characteristic rhythmic pattern. Therefore, music motives fit under the constraint of being a band information (m < M) that last a fixed period of time (n < N).
  • Temporal filters (1-by-n)</u>: setting the frequency dimension m to 1, such filters will not be capable of learning frequency features but will be specialized in modeling temporal dependencies relevant for the task to be learned from the training data. Note that, even though the filters themselves are not learning frequency features, upper layers may be capable of exploiting frequency relations present in the resulting feature map – the frequency interpretation for the M’ dimension of the subsequent feature map still hold because the convolution operation is done bin-wise (m=1). From the musical perspective, one expects these temporal filters to learn relevant rhythmic/tempo patterns within the analyzed bin.
  • Frequency filters (m-by-1)</u>: setting the time dimension n to 1, such filters will not be capable of learning temporal features but will be specialized in modeling frequency features relevant for the task to be learned from the training data. Similarly as for the temporal filters, upper layers can still find some temporal dependencies in the resulting feature map since the temporal interpretation of the N’ dimension still hold because the convolution operation is done frame-wise (n=1). From the musical perspective, one expects these frequency filters to learn timbre or equalization setups, for example. Moreover, note the resemblance of the frequency filters with the so used (and successful in MIR) NMF basis. As a final remark, note that the pitch invariant discussion introduced for the m-by-n filters also applies for frequency filters.

To conclude this section, we discuss the results posted by Keunwoo Choi as a study case. They use a 5-layer CNN of squared 3-by-3 filters for genre classification. After auralising and visualizing the network filters, they provide an interpretation of the learned CNNs filters in every layer:

  • Layer 1: onsets.
  • Layer 2: onsets, bass, harmonics, melody.
  • Layer 3: onsets, melody, kick, percussion.
  • Layer 4: harmonic structures, notes, vertical lines, long horizontal lines.
  • Layer 5: textures, harmo-rhythmic patterns structures.

Note that Keunwoo Choi observations are in concordance with the previously presented discussion. As a result of using small squared filters of 3-by-3, the lower layers of the deep CNN are learning musical concepts that fit under the constraint of being represented in a sub-band for a short-time. Moreover note that deeper layers in the network learn horizontal and vertical lines, denoting the plausible utility of the temporal and frequency filters in CNNs for MIR.

As observed in this example, the model needed deep representations (stacked CNN layers) for being able to represent large time-frequency contexts since it is difficult for the first layers to scope long time dependencies or wide frequency signatures with such small squared filters. This fact remarks the potential of employing temporal and frequency filters; by using these filters in the first layer(s), the depth of the network can be employed for learning other features rather than learning vertical and horizontal lines.

To conclude this text we want to remark that these interpretations do not only hold for music, since a similar reasoning could be done for speech audio or for any audio related deep learning task.

Next post proposes and assesses some musically motivated architectures that consider the here presented discussion.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,186评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,858评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,620评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,888评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,009评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,149评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,204评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,956评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,385评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,698评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,863评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,544评论 4 335
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,185评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,899评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,141评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,684评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,750评论 2 351

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,315评论 0 10
  • (二)我的三公里,五公里,十公里…… 我觉得跑步是一项最容易让人接受并投入的运动。这并不是因为许多跑步...
    姚小黑阅读 442评论 1 2
  • 最近在门诊上经常遇到咨询减肥问题的就诊者,发现大家对于减肥还是存在很多误区。根据咨询情况我总结了十条个人认为还蛮值...
    茜喵阅读 25,169评论 23 125
  • 晚秋的风骨 (印中) 漫上红遍,层林尽染。醉上眉头,又上心头。用多少美妙的诗句也写不尽北京最美的秋。北京秋天最好看...
    明烛高照阅读 527评论 0 2