Overview
以英伟达的geforce6800为例,我们将进一步了解现代gpu的架构。自1993年成立以来,英伟达公司已经成为最大的gpu制造商之一(除了ATI),已经发布了重要的芯片,如Geforce 256和Geforce 3。Geforce 6800发布于2004年,属于Geforce 6系列,是英伟达的第六代graphicschipset和第四代以可编程为特色的(稍后将详细介绍)。下图是Geforce 6800及其功能部件的示意图。图13:Geforce 6800的示意图您已经可以看到每个功能单元如何对应图形管道的各个阶段。我们从六个并行的顶点处理器开始,它们从主机(CPU)接收数据并执行诸如转换和光照等操作。
您已经可以看到每个功能单元是如何对应于图形管道的各个阶段的。我们从六个并行的顶点处理器开始,它们从主机(CPU)接收数据并执行转换和光照等操作。
接下来,输出进入三角设置阶段,该阶段负责原语组装、裁剪和剪辑,然后进入生成片段的光栅化程序。Geforce 6800有一个额外的z -cull单元,允许根据深度执行早期碎片可见性检查,进一步提高效率。然后我们继续到16个片段处理器,它在4个并行单元中操作,并计算每个片段的输出颜色。片段交叉条是一个链接元素,主要负责将输出像素定向到任何可用的像素引擎(也称为ROP,简称forRasterOperator),从而避免管道堵塞。16像素引擎是处理的最后阶段,在将最终的像素发送到帧缓冲区之前,它会执行诸如alpha混合、深度测试等操作。
GPU如何融入整个计算机系统
在现代计算机系统中,CPU通过图形连接器(如a)与GPU通信主板上的PCI Express或AGP插槽。因为图形连接器负责传输从CPU到GPU的所有命令、纹理和顶点数据,总线技术也随之发展在过去的几年里。原来的AGP槽运行在66兆赫和32位宽,给予一个转移速率为264 MB/秒。AGP 2x, 4x,和8x随后,每个加倍可用带宽,直到最后的PCIExpress standard于2004年推出,最大理论带宽同时为4gb /s可用于和从GPU。(你的里程可能不同;目前可用的主板芯片组秋天略低于这个限制——大约3.2 GB/秒或更少。)重要的是要注意到GPU的内存接口带宽和带宽之间的巨大差异系统其他部分如表所示。
计算机系统不同部分的可用内存带宽
Component | Bandwidth |
---|---|
GPU Memory Interface | 35 GB/sec |
PCI Express Bus (x16) | 8 GB/sec |
CPU Memory Interface (800 MHz Front-Side Bus) | 6.4 GB/sec |
GPU内部使用的带宽非常大,因此,在GPU上运行的算法可以利用这一点,实现了性能改进。
In Detail
虽然GPU的大多数部分是固定的功能单元,顶点和片段处理器的Geforce6800提供可编程性,这是第一次引入geforce芯片组线geforce 3(2001)。
Vertex Processor
顶点处理器是负责所有顶点变换和at-tribute计算的可编程单元。它们操作与顶点的上述齐次坐标对应的四维数据向量,使用每个坐标32位(因此寄存器的128位)。指令有123位长,存储在指令RAM中。
顶点处理器的数据路径包括:
- 一个四维向量的乘加单元
- 一个标量特殊函数单元
- 一个纹理单元
指令集(Instruction set)
主要指令集包括:
寄存器(Registers)
Fragment Processor
Geforce 6800有16个片段处理器。它们被分成4个更大的单元,每个单元模拟地运行在4个片段上(一个所谓的四轴飞行器)。它们可以采用位置、颜色、深度、雾以及其他任意的4维属性作为输入。
数据路径包括:
- 一个用于属性的插值块
- 2个向量数学(着色器)单元,每个都有稍微不同的功能
- 一个片段纹理单元
超尺度(Superscalarity)
一个片段处理器与4个向量一起工作(面向向量的指令集),有时向量的组成部分需要分开处理(例如颜色,alpha)。因此,片段处理器支持数据的共同发行,这意味着分裂成两个部分的矢量,并在同一个时钟上执行不同的操作。它支持3-1和2-2的分裂(2-2的共同问题之前是不可能的)。此外,它还具有双重问题,这意味着在同一个时钟中对2个向量mathunits执行不同的操作。
纹理单元(Texture Unit)
纹理单元是一个浮点纹理处理器,用于获取和过滤纹理数据。它连接到一级纹理缓存(存储部分使用的纹理)。
Shader units 1 and 2
每个着色器单元在其能力上是有限的,当一起使用时提供完整的功能。
Shader Unit 1
Green:A crossbar which distributes the input coming eiter from the rasterizer or from the loopback
Red:Interpolators
Yellow:A special function unit (for functions such as Reciprocal, Reciprocal Square Root, etc.)
Cyan:MUL channels
Orange:A unit for texture operations (not the fragment texture unit)
着色器单位可以执行2个操作每个时钟:一个MUL在一个三维矢量和一个特殊的功能,一个特殊的功能和一个纹理操作,或2MULs。特殊功能单元的输出可以进入MUL通道。纹理从MUL单元获得输入,并在将数据传递到实际的片段纹理单元之前进行LOD(细节级别)计算。然后片段纹理单元执行实际的采样,并为第二个着色单元写入数据到寄存器中。着色单元也可以简单地传递数据。
Shader Unit 2
Red:A crossbar
Cyan:4 MUL channels
Gray:4 ADD channels
Yellow:1 special function unit
横杆将输入分割为5个通道(4个组件,1个通道保持空闲)。添加单元被另外连接,允许在一个时钟中进行高级操作,如dotproduct。同样,着色单元可以处理2个独立的操作每个周期或它可以简单地传递数据。如果不使用特殊的功能,MAD单元可以执行这个列表中的2项操作:MUL、ADD、MAD、DP或基于这些操作的任何其他指令。
Instruction set
一些值得注意的关于顶点处理器的说明包括:
可修改片段处理器指令中的寄存器
- Negate the register value
- Take the absolute value of the register
- Mask destination register components
- The fragment processors can perform operations within 16 or 32 floating point precision (e.g. the fogunit uses only 16 bit precision for its calculations since that is sufficient)
- The quads operate as SIMD units•They use VLIW
- They run up to 100s of threads to hide texture fetch latency ( ̃256 per quad)
- A fragment processor can perform up to 8 operations per cycle / 4 math operations if there’s a texturefetch in shader 1
- The fragment processors have a 2 level texture cache
- The fog unit can perform fog blending on the final pass without performance penalty. It is implementedwith fixed point precision since that’s sufficient for fog and saves performance.The equation:
out = FogColor * fogFraction + SrcColor * (1 - fogFraction)
- There’s support for multiple render targets, the pixel processor can output to up to four seperatebuffers (4x4 values, color + depth)
Pixel Engine
管道中的最后一个是16像素的引擎(光栅操作符)。每个像素引擎连接到GPU的一个特定内存分区。在无损的颜色和深度压缩之后,深度和颜色单元在写入最终像素之前生成深度、颜色和模板操作。当激活像素引擎也执行多重反锯齿。
Memory
"The memory system is partitioned into up to four independent memory partitions, eachwith its own dynamic random-access memories (DRAMs). GPUs use standard DRAM modulesrather than custom RAM technologies to take advantage of market economies and thereby reducecost. Having smaller, independent memory partitions allows the memory subsystem to operateefficiently regardless of whether large or small blocks of data are transferred. All rendered surfacesare stored in the DRAMs, while textures and input data can be stored in the DRAMs or insystem memory. The four independent memory partitions give the GPU a wide (256 bits),flexible memory subsystem, allowing for streaming of relatively small (32-byte) memory accessesat near the 35 GB/sec physical limit."
内存系统被划分成四个独立的内存分区,每个分区都有自己的动态随机访问内存(DRAMs)。gpu使用标准的DRAM模块而不是定制的RAM技术来利用市场经济,从而减少成本。拥有较小的、独立的内存分区使内存子系统能够高效地运行,而不管传输的数据块是大是小。所有渲染的表面都存储在DRAMs中,而纹理和输入数据可以存储在DRAMs或insystem内存中。四个独立的内存分区给GPU一个宽的(256位),灵活的内存子系统,允许流的相对较小的(32字节)内存访问接近35gb /秒的物理限制。
Performance
- 425 MHz internal graphics clock
- 550 MHz memory clock•256-MB memory size
- 35.2 GByte/second memory bandwidth
- 600 million vertices/second
- 6.4 billion texels/second
- 12.8 billion pixels/second, rendering z/stencil-only (useful for shadow volumes and shadow buffers)
- 6 four-wide fp32 vector MADs per clock cycle in the vertex shader, plus one scalar multifunctionoperation (a complex math operation, such as a sine or reciprocal square root)
- 16 four-wide fp32 vector MADs per clock cycle in the fragment processor, plus 16 four-wide fp32multiplies per clock cycle
- 64 pixels per clock cycle early z-cull (reject rate)
- 120+ Gflops peak (equal to six 5-GHz Pentium 4 processors)
- Up to 120 W energy consumption (the card has two additional power connectors, the power sourcesare recommended to be no less than 480 W)
GPU Features
Fixed-Function Features
Geometry Instancing
使用Shader Model 3.0,一个Direct3D调用可以添加发送多个批次的几何图形,在这些情况下大大减少了驱动开销。支持实例化的硬件特性顶点流的频率-读取顶点属性的频率小于每次输出一次的能力或对顶点子集进行多次循环。实例化最有用的时候是同一个对象以不同的位置多次绘制,例如,绘制军队、战场时草。
Early Culling/Clipping
GeForce 6系列gpu能够在着色之前在一个高速率和剪辑部分可见原语在全速上剔除不可见几何。以前的NVIDIA产品会以原始设置的速度剔除不可见的原语,并以全速剪辑所有部分可见的几何。
Rasterization
与之前的NVIDIA产品一样,GeForce 6系列gpu能够渲染以下对象:
- Point sprites
- Aliased 和 平滑 lines
- Aliased 和 平滑 triangles
还支持多重反锯齿,允许精确的反锯齿多边形渲染。Multisample反锯齿支持所有光栅化。在以前的NVIDIA产品中支持多层采样,GeForce 6系列gpu通过4x multisample模式改进。
Z-Cull
从GeForce3开始,NVIDIA的gpu就有了一种名为z-cull的技术,可以快速移除隐藏的表面比传统的渲染快得多。GeForce 6系列z-cull单元是第三代技术,提高了更大范围案件的效率。此外,在没有模板的情况下更新后,早期模板拒绝可用于在模板测试失败。
Occlusion Query
Occlusion query is the ability to collect statistics on how many fragments passed or failed the depth test and
to report the result back to the host CPU. Occlusion query can be used either while rendering objects or with
color and z-write masks turned off, returning depth test status for the objects that would have been rendered,without modifying the contents of the frame buffer. This feature has been available since the GeForce3 was introduced.
Texturing
Like previous GPUs, GeForce 6 Series GPUs support bilinear(双线性), trilinear(三线性), and anisotropic filtering on 2D and cube-map textures of various formats. Three-dimensional textures support bilinear, trilinear, and quad-linear filtering, with and without mipmapping. Here are the new texturing features on GeForce 6 Series GPUs:
- Support for all texture types (2D, cube map, 3D) with fp16x2, fp16x4, fp32x1, fp32x2, and fp32x4
formats - Support for all filtering modes on fp16x2 and fp16x4 texture formats
- Extended support for non-power-of-two textures to match support for power-of-two textures,
specifically:
- Mipmapping
- Wrapping and clamping
- Cube map and 3D textures
Shadow Buffer Support
NVIDIA GPUs support shadow buffering directly. The application first renders the scene from the light source into a separate z-buffer. Then during the lighting phase, it fetches the shadow buffer as a projective texture and performs z-compares of the shadow buffer data against a value corresponding to the distance from the light. If the distance passes the test, it's in light; if not, it's in shadow. NVIDIA GPUs have dedicated
transistors to perform four z-compares per pixel (on four neighboring z-values) per clock, and to perform
bilinear filtering of the pass/fail data. This more advanced variation of percentage-closer filtering saves many shader instructions compared to GPUs that don't have direct shadow buffer support.
High-Dynamic-Range Blending Using fp16 Surfaces, Texture Filtering, and Blending(HDR)
GeForce 6 Series GPUs allow for fp16x4 (four components, each represented by a 16-bit float) filtered
textures in the pixel shaders; they also allow performing all alpha-blending operations on fp16x4 filtered
surfaces. This permits intermediate rendered buffers at a much higher precision and range, enabling
high-dynamic-range rendering, motion blur, and many other effects. In addition, it is possible to specify a
separate blending function for color and alpha values. (The lowest-end member of the GeForce 6 Series
family, the GeForce 6200 TC, does not support floating-point blending or floating-point texture filtering
because of its lower memory bandwidth, as well as to save area on the chip.)
Vertex Processor
Increased instruction count. The total instruction count is now 512 static instructions and 65,536
dynamic instructions. The static instruction count represents the number of instructions in a program
as it is compiled. The dynamic instruction count represents the number of instructions actually
executed. In practice, the dynamic count can be much higher than the static count due to looping
and subroutine calls.
- More temporary registers. Up to 32 four-wide temporary registers can be used in a vertex
program. - Support for instancing. This enhancement was described earlier.
- Dynamic flow control. Branching and looping are now part of the shader model. On the GeForce 6
Series vertex engine, branching and looping have minimal overhead of just two cycles. Also, each
vertex can take its own branches without being grouped in the way pixel shader branches are. So as
branches diverge, the GeForce 6 Series vertex processor still operates efficiently. - Vertex texturing. Textures can now be fetched in a vertex program, although only
nearest-neighbor filtering is supported in hardware. More advanced filters can of course be
implemented in the vertex program. Up to four unique textures can be accessed in a vertex
program, although each texture can be accessed multiple times. Vertex textures generate latency
for fetching data, unlike true constant reads. Therefore, the best way to use vertex textures is to do
a texture fetch and follow it with arithmetic operations to hide the latency before using the result of
the texture fetch.
Each vertex engine is capable of simultaneously performing a four-wide SIMD MAD (multiply-add) instruction
and a scalar special function per clock cycle. Special function instructions include: - Exponential functions: EXP, EXPP, LIT, LOG, LOGP
- Reciprocal instructions: RCP, RSQ
- Trigonometric functions: SIN, COS
Fragment Processor
- Increased instruction count. The total instruction count is now 65,535 static instructions and
65,535 dynamic instructions. There are limitations on how long the operating system will wait while
the shader finishes working, so a long shader program working on a full screen of pixels may
time-out. This makes it important to carefully consider the shader length and number of fragments
rendered in one draw call. In practice, the number of instructions exposed by the driver tends to be
smaller, because the number of instructions can expand as code is translated from Direct3D pixel
shaders or OpenGL fragment programs to native hardware instructions. -
Multiple render targets. The fragment processor can output to up to four separate color buffers,
along with a depth value. All four separate color buffers must be the same format and size. MRTs
can be particularly useful when operating on scalar data, because up to 16 scalar values can be
written out in a single pass by the fragment processor. Sample uses of MRTs include particle
physics, where positions and velocities are computed simultaneously, and similar GPGPU algorithms.
Deferred shading is another technique that computes and stores multiple four-component
floating-point values simultaneously: it computes all material properties and stores them in separate
textures. So, for example, the surface normal and the diffuse and specular material properties could
be written to textures, and the textures could all be used in subsequent passes when lighting the
scene with multiple lights. This is illustrated in Figure 30-8.
- Dynamic flow control (branching). Shader Model 3.0 supports conditional branching and looping,
allowing for more flexible shader programs. - Indexing of attributes. With Shader Model 3.0, an index register can be used to select which attributes to process, allowing for loops to perform the same operation on many different inputs.
- Up to ten full-function attributes. Shader Model 3.0 supports ten full-function attributes/texture
coordinates, instead of Shader Model 2.0's eight full-function attributes plus specular color and
diffuse color. All ten Shader Model 3.0 attributes are interpolated at full fp32 precision, whereas
Shader Model 2.0's diffuse and specular color were interpolated at only 8-bit integer precision. - Centroid sampling. Shader Model 3.0 allows a per-attribute selection of center sampling, or
centroid sampling. Centroid sampling returns a value inside the covered portion of the fragment,
instead of at the center, and when used with multisampling, it can remove some artifacts associated
with sampling outside the polygon (for example, when calculating diffuse or specular color using
texture coordinates, or when using texture atlases). - Support for fp32 and fp16 internal precision. Fragment programs can support full
fp32-precision computations and intermediate storage or partial-precision fp16 computations and
intermediate storage. -
3:1 and 2:2 coissue. Each four-component-wide vector unit is capable of executing two
independent instructions in parallel, as shown in Figure 30-9: either one three-wide operation on
RGB and a separate operation on alpha, or one two-wide operation on red-green and a separate
two-wide operation on blue-alpha. This gives the compiler more opportunity to pack scalar
computations into vectors, thereby doing more work in a shorter time.
-
Dual issue. Dual issue is similar to coissue, except that the two independent instructions can be
executed on different parts of the shader pipeline. This makes the pipeline easier to schedule and,
therefore, more efficient.
Achieving Optimal Performance
- Use Z-Culling Aggressively
- Exploit Texture Math When Loading Data
- Use Branching in Fragment Programs Judiciously
- Use fp16 Intermediate Values Wherever Possible
References
[1] Wikipedia entry on GPUshttp://en.wikipedia.org/wiki/GPU
[2] Kees Huizing, Han-Wei Shen: “The Graphics Rendering Pipeline”http://www.win.tue.nl/~keesh/ow/2IV40/pipeline2.pdf
[3] Cyril Zeller: “Introduction to the Hardware Graphics Pipeline”, Presentation at ACM SIGGRAPH2005http://download.nvidia.com/developer/presentations/2005/I3D/I3D_05_IntroductionToGPU.pdf
[4] ExtremeTech 3D Pipeline Tutorialhttp://www.extremetech.com/article2/0,1697,9722,00.asp
[5] Ashu Rege: “Introduction to 3D Graphics for Games”http://developer.nvidia.com/docs/IO/11278/Intro-to-Graphics.pdf
[6] DirectX Developer Center: “The Direct3D Transformation Pipeline”http://msdn.microsoft.com/en-us/library/bb206260(VS.85).aspx
[7] Mark Colbert: “GPU Architecture & CG”http://graphics.cs.ucf.edu/gpuseminar/seminar1.ppt
[8] GPU Gems 2, Chapter 30: “The GeForce 6 Series GPU Architecture”http://download.nvidia.com/developer/GPU_Gems_2/GPU_Gems2_ch30.pdf
[9] IEEE Micro, Volume 25 , Issue 2 (March 2005): “The GeForce 6800”http://portal.acm.org/citation.cfm?id=1069760[10] www.3dcenter.de: “NV40-Technik im Detail”http://www.3dcenter.de/artikel/nv40_pipeline/23
[11] www.digit-life.com: “NVIDIA GeForce 6800 Ultra (NV40)”http://www.digit-life.com/articles2/gffx/nv40-part1-a.html
[12] Austin Robison, Abe Winter: “An Overview of Graphics Processing Hardware”http://people.cs.uchicago.edu/~robison/src/gpu_paper.pdf
[13] John Montrym, Henry Moreton: “NVIDIA GeForce 6800”, Hot Chips 16http://www.hotchips.org/archives/hc16/2_Mon/13_HC16_Sess3_Pres1_bw.pdf
[14] Ajit Datar, Apurva Padhye: “Graphics Processing Unit Architecture”http://www.d.umn.edu/~data0003/Talks/gpuarch.pdf
[15] Sven Schenk: “Eine Einfuehrung in die Architektur moderner Graphikprozessoren”http://sus.ti.uni-mannheim.de/Lehre/Seminar0506/04modernGPUs.pdf
[16] Thomas Scott Crow: “Evolution of the Graphical Processing Unit”http://www.cse.unr.edu/~fredh/papers/thesis/023-crow/GPUFinal.pdf
[17] DirectX Developer Center: “Asm Shader Reference”http://msdn.microsoft.com/en-us/library/bb219840(VS.85).aspx
[18] Erik Lindholm, Stuart Oberman: “NVIDIA GeForce 8800 GPU”http://www.hotchips.org/archives/hc19/2_Mon/HC19.02/HC19.02.01.pdf
[19] www.digit-life.com: “Say Hello To DirectX 10, Or 128 ALUs In Action: NVIDIA GeForce 8800 GTX (G80)”http://www.digit-life.com/articles2/video/g80-part1.html
[20] Richard Hough, Richard Yu: “GPU Architecture”http://www.csl.cornell.edu/courses/ece685/slides/GPUArchitecture.ppt
[21] Technical Brief: “NVIDIA GeForce 8800 GPU Architecture Overview”http://www.nvidia.com/object/IO_37100.html
[22] GPU Gems 2, Chapter 46: “Improved GPU Sorting”
[23] Tim Purcell: “Sorting and Searching”, SIGGRAPH 2005 GPGPU COURSEhttp://www.gpgpu.org/s2005/slides/purcell.SortingAndSearching.ppt
[24] Peter Kipfer, Mark Segal, Ruediger Westermann: “UberFlow: A GPU-Based Particle Engine”http://www.graphicshardware.org/previous/www_2004/Presentations/PeterKipfer.pdf
[25] Wikipedia entry on Nvidiahttp://en.wikipedia.org/wiki/Nvidia_Corporation
[26] Wikipedia entry on ATIhttp://en.wikipedia.org/wiki/ATI_Technologies_Inc.
[27] Wikipedia entry on CUDAhttp://en.wikipedia.org/wiki/CUDA
[28] Wikipedia entry on CTMhttp://en.wikipedia.org/wiki/Close_to_Metal
[29] William Mark, Henry Moreton: “3D Graphics Architecture Tutorial”
http://www-csl.csres.utexas.edu/users/billmark/talks/Graphics_Arch_Tutorial_Micro2004_BillMarkParts.pdf24