转载自:https://www.anandtech.com/show/16640/arm-announces-neoverse-v1-n2-platforms-cpus-cmn700-mesh/7
CMN700核间互联网络:更大、更灵活
从上次讨论过ARM的Corherent Mash Network,已经过去5年,即当前正在使用的CMN600, The IP was announced quite some time ago, but has been a mainstay of Arm’s infrastructure IP for some time now, with it seeing some iterations in terms of IP revisions, with r2 introducing some important changes such as larger caches and CCIX capability.
随着V1和N2的发布,ARM发布了新一代CMN协议:CMN700。承诺该协议在ARM Mash Network的操作和扩展性、灵活性及性能方面有非常大的改进。
从新设计基本特性,最重要的特性是mash网络从原始的8x8nodes(64)限制增长到12x12nodes(144),这将增加单独mash网络和die中的cpu个数。
术语:
RN-F:Fully coherent Request Node--通常指一个cpu core,或者一个包含两个cpu core的CAL,或者一个DSU cluster。
HN-F:Fully coherent Home Node--通常指A block of SLC cache with Snoop Filter
CAL:Component Agregration Layer--通常指A block that houses two CPU cores connecting to one RN-F port
实际上,该mash中的core数从64增长到256,the latter number achievable through 128 RN-F request nodes each with 2 cores through a CAL (Component Aggregation Layer).细心的读者在看到ARM说CMN600最多支持64核时,会感到奇怪,因为之前提到的Ampre Altra Q80实现了80核。Arm explained that the 64-core limit is through native cores connected to RN-F’s or through CALs, and that it’s actually possible to host more cores when you integrate them into the mesh through DSU (DynamiQ Shared Units). Ampere 从未答复过其mash的layout,但是这是唯一可以解释为什么可以在CMN600上实现80核的原因。
Alongside 128 RN-Fs, hosting up to 256 cores, the chip hosts up to 128 HN-F home nodes, meaning nodes in which the SLC (System Level Cache) resides。ARM实现512MB/DIE的SLC,意味着4MB/node,CMN-600仅支持128MB SLC, which technically is incorrect given that the reference manual says it goes up to 256MB at 4MB per node at 64 nodes。
In both cases, the SLC figures are a bit extreme and one shouldn’t expect designs with such sizes anytime soon.
当前的Graviton2和Ampere AltraQ80仅使用了32MB SLC,其中原因是未曾讨论过的,基于当前实际的SLC,HN-F节点包含了snoop fliter caches, that have particularly high size requirements. Arm states that generally the snoop filters need to be at least 1.5x the size of the aggregate exclusive caches of the cores, so in the case of the Altra Q with 80 cores and 1MB L2’s per core, that’s at least 120MB of required snoop filters caches on the mesh, alongside the 32MB of SLC. This would be very well a possible explanation as to why the SLCs are so small compared to say what AMD and Intel employ – the former for example using shadow tags of the L2’s for coherency (And the IOD having shadow tags of the CCD L3’s). It seems Arm’s design here is less area efficient.
CMN-700中的存储控制器(CHI SN-F nodes)从16个增长到40个,ARM希望在最新的设计中使用混合存储系统结构和设计。
最终CCIX端口被从4增加到32个,对于期望被部署的chiplet也非常严格。
在内存的能力方面,我们提到过,ARM期望使用混合设计架构,其不仅可以增加ddr controller个数,还可以集成HMB内存。SiPearl的Rhea被确认采用了4个HMB2E和4-6个DDR5 controller。CMN-700也同样可以处理内存arrangement和管理bandwidth和异构内存架构中的traffic问题。
TODO